In my django project I have the following dependencies:
django==1.5.4
django-celery==3.1.9
amqp==1.4.3
kombu==3.0.14
librabbitmq==1.0.3 (as suggested by https://stackoverflow.com/a/17541942/1452356)
In dev_settings.py:
DEBUG = False
BROKER_URL = "django://"
import djcelery
djcelery.setup_loader()
CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
CELERYD_CONCURRENCY = 2
# CELERYD_TASK_TIME_LIMIT = 10
CELERYD_TASK_TIME_LIMIT is commented as suggested here https://stackoverflow.com/a/17561747/1452356 along with debug_toolbar as suggested by https://stackoverflow.com/a/19931261/1452356
I start my worker in a shell with:
./manage.py celeryd --settings=dev_settings
Then I send a task:
class ExempleTask(Task):
def run(self, piProjectId):
table = []
for i in range(50000000):
table.append(1)
return None
Using a django command:
class Command(BaseCommand):
def handle(self, *plArgs, **pdKwargs):
loResult = ExempleTask.delay(1)
loResult.get()
return None
With:
./manage.py purge_and_delete_test --settings=dev_settings
I monitor the memory usage with:
watch -n 1 'ps ax -o rss,user,command | sort -nr | grep celery |head -n 5'
Every time I call the task, it increase the memory consumption of the celeryd/worker process, proportionally to the amount of data allocated in it...
It seems like a common issue (c.f. others stackoverflow link), however I couldn't fix it, even with the latest dependencies.
Thanks.
This is a Python and OS issue, not really a django or celery issue. Without getting too deep:
1) A process will never free memory addressing space once it has requested it from the OS. It never says "hey, I'm done here, you can have it back". In the example you've given, I'd expect the process size to grow for a while, and then stabilize, possibly at a high base line. After your example allocation, you might call the gc interface to force a garbage collect to see how
2) This isn't usually a problem, because unused pages are paged out by the OS because your process stops accessing that address space that it has deallocated.
3) It is a problem if your process is leaking object references, preventing python from garbage collecting to re-appropriate the space for later reuse by that process, and requiring your process to ask for more address space from the OS. At some point, the OS cries uncle and will (probably) kill your process with its oomkiller or similar mechanism.
4) If you are leaking, either fix the leak or set CELERYD_MAX_TASKS_PER_CHILD, and your child processes will (probably) commit suicide before upsetting the OS.
This is a good general discussion on Python's memory management:
CPython memory allocation
And a few minor things:
Use xrange not range - range will generate all values then iterate over that list. xrange is just a generator. Have set Django DEBUG=False?
Related
I am following the tutorial on
Center for High Throughput Computing and Introduction to Configuration in the HTCondor website to set up a Partitionable slot. Before any configuration I run
condor_status
and get the following output.
I update the file 00-minicondor in /etc/condor/config.d by adding the following lines at the end of the file.
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
and reconfigure
sudo condor_reconfig
Now with
condor_status
I get this output as expected. Now, I run the following command to check everything is fine
condor_status -af Name Slotype Cpus
and find slot1#ip-172-31-54-214.ec2.internal undefined 1 instead of slot1#ip-172-31-54-214.ec2.internal Partitionable 4 61295 that is what I would expect. Moreover, when I try to summit a job that asks for more than 1 cpu it does not allocate space for it (It stays waiting forever) as it should.
I don't know if I made some mistake during the installation process or what could be happening. I would really appreciate any help!
EXTRA INFO: If it can be of any help have have installed HTCondor with the command
curl -fsSL https://get.htcondor.org | sudo /bin/bash -s – –no-dry-run
on Ubuntu 18.04 running on an old p2.xlarge instance (it has 4 cores).
UPDATE: After rebooting the whole thing it seems to be working. I can now send jobs with different CPUs requests and it will start them properly.
The only issue I would say persists is that Memory allocation is not showing properly, for example:
But in reality it is allocating enough memory for the job (in this case around 12 GB).
If I run again
condor_status -af Name Slotype Cpus
I still get something I am not supposed to
But at least it is showing the correct number of CPUs (even if it just says undefined).
What is the output of condor_q -better when the job is idle?
I have a program written by someone else that uses OpenMP. I am running it on a cluster that uses Slurm as its job manager. Despite setting OMP_NUM_THREADS=72 and properly requesting 72 cores for the job, the job is only using four cores.
I have already used scontrol show job <job_id> --details to verify that there are 72 cores assigned to the job. I have also remoted into the node that the job is running on and used htop to inspect it. It was running 72 threads, all on four cores. It is worth noting that this is on an SMT4 power9 cpu, meaning that each physical core executes 4 simultaneous threads. Ultimately, it looks like openMP is putting all threads on one physical core. This is further complicated by the fact that this is an IBM system. I can't seem to find any useful documentation on more fine control of the openMP environment. Everything I find is for Intel.
I have also tried using taskset to manually change the affinity. This worked as intended and moved one of the threads to an unused core. The program continued to work as intended after this.
I could theoretically write a script to find all of the threads and call taskset to assign them to cores in a logical way, but I am afraid to do this. It seems like a bad idea to me. It would also take a while.
I guess my main question would be, is this a Slurm problem, an openMP problem, an IBM problem or a user error? Is there some environment variable I don't know about that I need to set? Will it break Slurm if I manually call taskset using a script? I would use scontrol to figure out which cpus are assigned to the job if I did that. I don't want to anger the people who run the cluster by messing things up though.
Here is the submission script. I can't include any of the actual running code due to license issues though. I'm hoping this will just be a simple matter of fixing an environment variable. The MPI_OPTIONS variables were recommended by the guy who administers the system. If by some chance someone here has worked with the ENKI cluster before, that's where this is running.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module load openmpi/3.1.3/2019
module load pgi/2019
export OMP_NUM_THREADS=72
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh
Edit: Fix resulted in a 7x speedup when using 72 cores, vs. just running on 4 cores. Considering the nature of the calculations being run, this is pretty good.
Edit 2: Fix resulted in a 17x speedup when using 160 vs. just running on 4 cores.
This might not work for everyone, but I have a really hacky solution. I wrote a python script that uses psutil to find all threads that are children of the running process and set their affinity manually. This script uses scontrol to figure out which cpus are assigned to the job and uses taskset to force the threads to distribute across those cpus.
So far the process is running a lot faster. I'm sure that forcing CPU affinity isn't the best way to do it, but its a lot better than not using the available resources at all.
Here is the basic idea behind the code. The program I am running is called pgmc, hence the variable names. You will need to create an anaconda environment with psutil installed if you are running on a system like mine.
import psutil
import subprocess
import os
import sys
import time
# Gets the id for the current job.
def get_job_id():
return os.environ["SLURM_JOB_ID"]
# Returns a list of processors assigned to the job and the total number of cpus
# assigned to the job.
def get_proc_info():
run_str = 'scontrol show job %s --details'%get_job_id()
stdout = subprocess.getoutput(run_str)
id_spec = None
num_cpus = None
chunks = stdout.split(' ')
for chunk in chunks:
if chunk.lower().startswith("cpu_ids"):
id_spec = chunk.split('=')[1]
start, stop = id_spec.split('-')
id_spec = list(range(int(start), int(stop) + 1))
if chunk.lower().startswith('numcpus'):
num_cpus = int(chunk.split('=')[1])
if id_spec is not None and num_cpus is not None:
return id_spec, num_cpus
raise Exception("Couldn't find information about the allocated cpus.")
if __name__ == '__main__':
# Before we do anything, make sure that we can get the list of cpus
# assigned to the job. Once we have that, run the command line supplied.
cpus, cpu_count = get_proc_info()
if len(cpus) != cpu_count:
raise Exception("CPU list didn't match CPU count.")
# If we successefully got to here, run the command line.
program_name = ' '.join(sys.argv[1:])
pgmc = subprocess.Popen(sys.argv[1:])
time.sleep(10)
pid = [proc for proc in psutil.process_iter() if proc.name() == "your program name here"][0].pid
# Now that we have the pid of the pgmc process, we need to get all
# child threads of the process.
pgmc_proc = psutil.Process(pid)
pgmc_threads = list(pgmc_proc.threads())
# Now that we have a list of threads, we loop over available cores and
# assign threads to them. Once this is done, we wait for the process
# to complete.
while len(pgmc_threads) != 0:
for core_id in cpus:
if len(pgmc_threads) != 0:
thread_id = pgmc_threads[-1].id
pgmc_threads.remove(pgmc_threads[-1])
taskset_string = 'taskset -cp %i %i'%(core_id, thread_id)
print(taskset_string)
subprocess.getoutput(taskset_string)
else:
break
# All of the threads should now be assigned to a core.
# Wait for the process to exit.
pgmc.wait()
print("program terminated, exiting . . . ")
Here is the submission script used.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module purge
module load openmpi/3.1.3/2019
module load pgi/2019
module load anaconda3
# This is the anaconda environment I created with psutil installed.
conda activate psutil-node
export OMP_NUM_THREADS=72
# The two MPI_OPTIONS lines are specific to this cluster if I'm not mistaken.
# You probably won't need them.
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time python3 affinity_set.py mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh
My main reason for including the submission script is to demonstrate how the python script is used. More specifically, you call it, with your real job as an argument.
In essence, the following function, called by the user of the django application that I am developing, uses the Scapy library to process 80-odd fairly large pcaps in order to initially parse their destination IP addresses.
I was wondering whether it would be possible to process several pcaps simultaneously, as the CPU is not being utilised to it's full capacity, ideally using multi-threading
def analyseall(request):
allpcaps = Pcaps.objects.all()
for individualpcap in allpcaps:
strfilename = str(individualpcap.filename)
print(strfilename)
pcapuuid = individualpcap.uuid
print(pcapuuid)
packets = rdpcap(strfilename)
print("hokay")
for packet in packets:
if packet.haslayer(IP):
# print(packet[IP].src)
# print(packet[IP].dst)
dstofpacket = packet[IP].dst
PcapsIps.objects.update_or_create(ip=dstofpacket, uuid=individualpcap)
return render(request, 'about.html', {"list": list})
You can use above answer (multiprocessing), and also improve scapy’s reading speed, by using the PcapReader generator rather than rdpcap
with PcapReader(filename) as fdesc:
for pkt in fdesc:
[actions on the pkt]
I consider mixing multiprocessing and Django tricky. I was working on such solution once and finally I decided to use Celery and RabbitMQ.
Using Celery you can easily define task of processing single pcap. Then you can start a few independent workers for processing files in the background. Such solution will result in a little more complicated architecture (you need to provide message queue e. g. RabbitMQ and the Celery workers), however you can gain a much simpler code.
http://docs.celeryproject.org/en/latest/django/first-steps-with-django.html
In my case Celery saved a lot of time.
You can also check this question and answers:
How to use python multiprocessing module in django view
kinda a Celery noob here but, I think I have a configuration issue where Celery is putting too much stuff in Redis
my goal is to attempt to reduce or optimize the amount of memory Redis is using, if I can
i have a pretty large Django production thing, where Celery jobs are run "a lot". In my settings.py I have
BROKER_BACKEND = "redis"
From a top -p13907 Redis is using a ton of memory (on the box it's only used by Celery):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13907 redis 20 0 10.5g 3.3g 532 S 0 42.8 109:38.94 redis-server
I found this CELERY_TASK_RESULT_EXPIRES setting which looks like something I should add to my settings file.
By default, from the documentation it looks like it's set to 1 day (86400 seconds)
Is this what I wanna change? Or is there more settings I should look into? Another thing I'm unsure about is if I add it, how should I go about deciding whats a "safe" number of seconds to set it to?
i guess maybe your celery caller forget to clean up result and these result will be stored in message queue server until expiration. In celery, you have to call
r.get()
to get result and clean it in message queue. if you only access the result without calling this function:
r.result
the result would be still holding by message queue server and consume your memory a lot!
Common situation: I have a client on my server who may update some of the code in his python project. He can ssh into his shell and pull from his repository and all is fine -- but the code is stored in memory (as far as I know) so I need to actually kill the fastcgi process and restart it to have the code change.
I know I can gracefully restart fcgi but I don't want to have to manually do this. I want my client to update the code, and within 5 minutes or whatever, to have the new code running under the fcgi process.
Thanks
First off, if uptime is important to you, I'd suggest making the client do it. It can be as simple as giving him a command called deploy-code. Using your method, if there is an error in their code, your method requires a 10 minute turnaround (read: downtime) for fixing it, assuming he gets it correct.
That said, if you actually want to do this, you should create a daemon which will look for files modified within the last 5 minutes. If it detects one, it will execute the reboot command.
Code might look something like:
import os, time
CODE_DIR = '/tmp/foo'
while True:
if restarted = True:
restarted = False
time.sleep(5*60)
for root, dirs, files in os.walk(CODE_DIR):
if restarted=True:
break
for filename in files:
if restared=True:
break
updated_on = os.path.getmtime(os.path.join(root, filename))
current_time = time.time()
if current_time - updated_on <= 6 * 60: # 6 min
# 6 min could offer false negatives, but that's better
# than false positives
restarted = True
print "We should execute the restart command here."