I have a genetic algorithm which I would like to speed up. I'm thinking the easiest way to achieve this is by pythons multiprocessing module. After running cProfile on my GA, I found out that most of the computational time takes place in the evaluation function.
def evaluation():
scores = []
for chromosome in population:
How would I go about to parallelize this method? It is important that all the scores append in the same order as they would if the program would run sequentially.
I'm using python 2.7

Use pool (I show both imap and map because of some results on google say map may not be OK for ordering though I have yet to see proof):
from multiprocessing import Pool
def evaluation(population):
return list(Pool(processes=nprocs).imap(costly_function,population))
or (what I use):
return Pool(processes=nprocs).map(costly_function,population)
Define nprocs to the number of parallel process you want.


Pathos multiprocessing pool hangs

I'm trying to use multiprocessing inside docker container. However, I'm facing two issues.
(I'm using python 2.7)
Creating ProcessingPool()/Pool() (I tried both) takes abnormally long time to create. Maybe over a minute or two.
After it processes the function, it hangs.
I basically trying to run a very simple case inside my container. Here's what I have..
import pathos.multiprocessing import ProcessingPool
import multiprocessing
class MultiprocessClassExample():
def worker(self, number):
return "Printing number %s" %(number)
def generateNumber(self):
PROCESSES = multiprocessing.cpu_count() - 1
NUMBER = ['One', 'Two', 'Three', 'Four', 'Five']
result = ProcessingPool(PROCESSES).map(self.worker, NUMBER)
print("Finished processing.")
and I call using the following code.
Now, this seems fairly straight forward enough. I ran this on a jupyter notebook and it ran without an issue. I also tried running python inside my docker container, and tried running the above code inside, and it went fine. So I'm assuming it has to do with the complete code that I have. Obviously I didn't write out all the code, but that's the main section of the code I'm trying to handle right now.
I would expect the above code to work as well. However, first thing I notice is that when I call ProcessingPool(), it takes a long time. I tried regular multiprocessing.Pool() before, and had the same effect. Whereas, in the notebook, it ran very quick and smoothly.
After waiting several minutes, it prints :
Printing number One
Printing number Two
Printing number Three
Printing number Four
Printing number Five
and that's it. It never prints out Finished processing. and it just hangs there.
But when the print statements appear, I notice that several debug message appear at the same time. It says
[WARNING] Worker graceful timeout
[INFO] Worker exiting
[INFO] Booting worker with pid:
Any suggestions would be greatly appreciated.

pytorch dataloader stucked if using opencv resize method

I can run all the cells of the tutorial notebook of Pytorch about dataloading (pytorch tutorial).
But when I use OpenCV in place of Skimage to resize the image, the dataloader gets stuck, i.e nothing happens.
In the Rescale class:
class Rescale(object):
def __call__(self, sample):
#img = transform.resize(image, (new_h, new_w))
img = cv2.resize(image, (new_h, new_w))
The dataloader and the for loop are defined with:
dataloader = DataLoader(transformed_dataset, batch_size=4,
shuffle=True, num_workers=4)
for i_batch, sample_batched in enumerate(dataloader):
print(i_batch, sample_batched['image'].size(),
I can get the iterator to print something if num_workers=0. It looks like opencv does not play well with the Multiprocessing of pytorch.
I would really prefer to use same package to transform the images at train time and test time (and I am already using OpenCV for the image rescale at test time).
Any suggestions would be greatly appreciated.
I had a very similar problem and that's how I solved it:
when you import cv2 set cv2.setNumThreads(0)
and then you can set num_workers>0 in the dataloader in PyTorch.
Seems like OpenCV tries to multithread and somewhere something goes into a deadlock.
Hope it helps.
Except for cv2.setNumThreads(0),
import multiprocessing
can also solve this problem.
From Pytorch issues these five may help (not recommended):
1. time.sleep(0.003)
2. pin_memory = True/False
3. num_workers = 0/1
4. from import DataLoader
5. writing 8192 to /proc/sys/kernel/shmmni
OpenCV and Pytorch multiprocessing don't play well together, sometimes. When running a code with OpenCV functions calls embedded in a homebrewed function parallelized in a "multiprocessing" pool, the code eventually ends up with idle processors after several calls to the pool than can fluctuate from run to run.
Forking could be the problem, forking only clones the current thread. In this case, the thread pool may wrongly assume that it has more threads than it has.
The code is waiting infinitely for a condition never signaled because the number of threads is not as expected after forking.
Spawning resets the memory after forking, forcing a re-initialization of all data structures (as the thread pool) within OpenCV.
More discussion in Github of OpenCV issues and Pytorch issues.

Process Several Pcap Files Simultaneously - Django

In essence, the following function, called by the user of the django application that I am developing, uses the Scapy library to process 80-odd fairly large pcaps in order to initially parse their destination IP addresses.
I was wondering whether it would be possible to process several pcaps simultaneously, as the CPU is not being utilised to it's full capacity, ideally using multi-threading
def analyseall(request):
allpcaps = Pcaps.objects.all()
for individualpcap in allpcaps:
strfilename = str(individualpcap.filename)
pcapuuid = individualpcap.uuid
packets = rdpcap(strfilename)
for packet in packets:
if packet.haslayer(IP):
# print(packet[IP].src)
# print(packet[IP].dst)
dstofpacket = packet[IP].dst
PcapsIps.objects.update_or_create(ip=dstofpacket, uuid=individualpcap)
return render(request, 'about.html', {"list": list})
You can use above answer (multiprocessing), and also improve scapy’s reading speed, by using the PcapReader generator rather than rdpcap
with PcapReader(filename) as fdesc:
for pkt in fdesc:
[actions on the pkt]
I consider mixing multiprocessing and Django tricky. I was working on such solution once and finally I decided to use Celery and RabbitMQ.
Using Celery you can easily define task of processing single pcap. Then you can start a few independent workers for processing files in the background. Such solution will result in a little more complicated architecture (you need to provide message queue e. g. RabbitMQ and the Celery workers), however you can gain a much simpler code.
In my case Celery saved a lot of time.
You can also check this question and answers:
How to use python multiprocessing module in django view

Parallelizing a nested Python for loop

What type of parallel Python approach would be suited to efficiently spreading the CPU bound workload shown below. Is it feasible to parallelize the section? It looks like there is not much tight coupling between the loop iterations i.e. portions of the loop could be handled in parallel so long as an appropriate communication to reconstruct the store variable is done at the end. I'm currently using Python2.7, but if a strong case could be made that this problem can be easily handled in a newer version, then I will consider migrating the code base.
I have tried to capture the spirit of the computation with the example below. I believe that it has the same connectedness between the loops/variables as my actual code.
nx = 20
ny = 30
myList1 = [0]*100
myList2 = [1]*25
value1 = np.zeros(nx)
value2 = np.zeros(ny)
store = np.zeros(nx,ny,len(myList1),len(myList2))
for i in range(nx):
for j in range(ny):
f = calc(value1[i],value2[j]) #returns a list
for k,data1 in enumerate(myList1):
for p,data2 in enumerate(myList2):
meanval = np.sum(f[:]/data1)*data2
store[i,j,k,p] = meanval
Here are two approaches you can take. What's wise also depends on where the bottleneck is, which is something that can best be measured rather than guessed.
The ideal option would be to leave all low level optimization to Numpy. Right now you have a mix of native Python code and Numpy code. The latter doesn't play well with loops. They work, of course, but by having loops in Python, you force operations to take place sequentially in the order you specified. It's better to give Numpy operations that it can perform on as many elements at once as possible, i.e. matrix transformations. That benefits performance, not only because of automatic (partial) parallelization; even single threads will be able to get more out of the CPU. A highly recommended read to learn more about this is From Python to Numpy.
If you do need to parallelize pure Python code, you have few options but to go with multiple processes. For that, refer to the multiprocessing module. Rearrange the code into three steps:
Preparing the inputs for every job
Dividing those jobs between a pool of workers to be run in parallel (fork/map)
Collecting the results (join/reduce)
You need to strike a balance between enough processes to make parallelizing worthwhile, and not so many that they will be too short-lived. The cost of spinning up processes and communicating with them would then become significant by itself.
A simple solution would be to generate a list of (i,j) pairs, so that there will nx*ny jobs. Then make a function that takes such pair as input and returns a list of (i,j,k,p,meanval). Try to only use the inputs to the function and return a result. Everything local; no side effects et cetera. Read-only access to globals such as myList1 is okay, but modification requires special measures as described in the documentation. Pass the function and the list of inputs to a worker pool. Once it has finished producing partial results, combine all those into your store.
Here's an example:
from multiprocessing import Pool
import numpy as np
# Global variables are OK, as long as their contents are not modified, although
# these might just as well be moved into the worker function or an initializer
nx = 20
ny = 30
myList1 = [0]*100
myList2 = [1]*25
value1 = np.zeros(nx)
value2 = np.zeros(ny)
def calc_meanvals_for(pair):
"""Process a reasonably sized chunk of the problem"""
i, j = pair
f = calc(value1[i], value2[j])
results = []
for k, data1 in enumerate(myList1):
for p, data2 in enumerate(myList2):
meanval = np.sum(f[:]/data1)*data2
return results
# This module will be imported by every worker - that's how they will be able
# to find the global variables and the calc function - so make sure to check
# if this the main program, because without that, every worker will start more
# workers, each of which will start even more, and so on, in an endless loop
if __name__ == '__main__':
# Create a pool of worker processes, each able to use a CPU core
pool = Pool()
# Prepare the arguments, one per function invocation (tuples to fake multiple)
arg_pairs = [(i,j) for i in range(nx) for j in range(ny)]
# Now comes the parallel step: given a function and a list of arguments,
# have a worker invoke that function with one argument until all arguments
# have been used, collecting the return values in a list
return_values =, arg_pairs)
# Since the function also returns a list, there's now a list of lists - consider
# itertools.chain.from_iterable to flatten them - to be processed further
store = np.zeros(nx, ny, len(myList1), len(myList2))
for results in return_values:
for i, j, k, p, meanval in results:
store[i,j,k,p] = meanval

Map-style parallel processing?

I'm trying to do some very simple parallel processing in python. I have a list of data, and I want to compute the exact same thing for each element, and return it as a list, so I looked into some simple map-style modules available (
I previously used the pprocess module, but it does not seem work this time. I looked into using either forkmap or forkfun, but I haven't really found some nice examples on how to use them.
What would you recommend as the easiest to use map-style parallel processing module? Preferably with a tutorial of some sort.
First off Im not sure having multiple threads would make your program go much faster (but would be curious to see what is the speed up)
I would not use a special tutorial/module and just use basic process/threading stuff
from multiprocessing import Process, Lock, Queue
Output values out of map into the queue (Queue())
processess = []
results_queue = Queue()
for i in xrange(50):
p = Process(target=MyMapFunction, args=tab[i*50:(i+1) * 50])
# Waiting and Reducing...
all_key_values = {}
for _ in xrange(50):
for k, v in results_queue.get():
all_key_values.setdefault(k, []).append(v)
# Some sort of check that threads are done but they should be
for p in processess:
def MyMapFunction(tab):
return [(x, 2 * x) for x in tab]
I let you do the reduce the same way I did the map and correct the shitty i*50 : (i+1) * 50 that I wrote quickly to give an example
This is the pattern I use when I want to multithread in Python