I'm trying to do some very simple parallel processing in python. I have a list of data, and I want to compute the exact same thing for each element, and return it as a list, so I looked into some simple map-style modules available (https://wiki.python.org/moin/ParallelProcessing).
I previously used the pprocess module, but it does not seem work this time. I looked into using either forkmap or forkfun, but I haven't really found some nice examples on how to use them.
What would you recommend as the easiest to use map-style parallel processing module? Preferably with a tutorial of some sort.
First off Im not sure having multiple threads would make your program go much faster (but would be curious to see what is the speed up)
I would not use a special tutorial/module and just use basic process/threading stuff
from multiprocessing import Process, Lock, Queue
Output values out of map into the queue (Queue())
processess = []
results_queue = Queue()
for i in xrange(50):
p = Process(target=MyMapFunction, args=tab[i*50:(i+1) * 50])
processess.append(p)
p.start()
# Waiting and Reducing...
all_key_values = {}
for _ in xrange(50):
for k, v in results_queue.get():
all_key_values.setdefault(k, []).append(v)
# Some sort of check that threads are done but they should be
for p in processess:
p.join()
def MyMapFunction(tab):
return [(x, 2 * x) for x in tab]
I let you do the reduce the same way I did the map and correct the shitty i*50 : (i+1) * 50 that I wrote quickly to give an example
This is the pattern I use when I want to multithread in Python
Related
I am trying to run a loop for 100,000 times. I have used map function as shown below to divide the work between processors and make it less time consuming.
But also I have to pass the variable as argument to the map function due to which it consumes more time as compared to when I define this variable inside the main function. But the problem with define the variable inside the main function is - this variable is generated by random function hence when different processors come to pick function every time it give new random gussian plot and this is not required.
Hence- as a solution I defined the gussian random function out of the main function and passed as an argument to main function. But now the map is consuming more time to process. Can any one please help to reduce the time of map processing or suggest me where to define the random gussian variable so that it is calculated once and picked by different processors.
Defining random gussian variable to pass as an argument to map function
Code
def E_and_P(Velocity_s, Position_s,tb):
~
~
for index in range(0,4000):
return X_position,Y_position,Z_position, VX_Vel, VY_Vel
if __name__ == "__main__":
Velocity_mu = 0
Velocity_sigma = 1*1e8 # mean and standard deviation
Velocity_s = np.random.normal(Velocity_mu, Velocity_sigma, 100000)
print("Velocity_s =", Velocity_s)
#print("Velocity_s=", Velocity_s)
Position_mu = 0
Position_sigma = 1*1e-9 # mean and standard deviation
Position_s = np.random.normal(Position_mu, Position_sigma, 100000)
#print("Position_s =", Position_s)
tb = range(100000)
#print("tb=",tb)
items = [(Velocity_s, Position_s,tb) for tb in range(100000)]
p = Pool(processes=4)
result = p.starmap(E_and_P, items)
p.close()
p.join()
Please help or suggest some new ways.
Based on your last comment, you could change this line:
items = [(Velocity_s, Position_s,tb) for tb in range(100000)]
to:
items = [(Velocity_s[tb], Position_s[tb],tb) for tb in range(100000)]
Each element of items is now a simple 3-number tuple: one velocity, one position, and one index.
You will also have to change E_and_P since its arguments are now 3 scalers: one velocity, one position, and one index.
def E_and_P(vel, pos, tb):
~
~
This should dramatically improve the performance. When using multiprocessing, keep in mind that different processes do not share an address space. All the data that gets exchanged between processes must be digitized on one end and rebuilt as Python objects on the other end. As I indicated in my comment, your original implementation resulted in about 20 billion digitizations. This approach still has 100000 steps, but each step needs to digitize only 3 numbers. So 300000 digitizations instead of 2000000000.
Given an Observable<Input> and a mapping function Function<Input, Output> that is expensive but takes variable time, is there a way to call the mapping function in parallel on multiple inputs, and receive the outputs in the order they're produced?
I've tried using observeOn() with a multi-threaded Scheduler:
PublishSubject<Input> inputs = PublishSubject.create();
Function<Input, Output> mf = ...
Observer<Output> myObserver = ...
// Note: same results with newFixedThreadPool(2)
Executor exec = Executors.newWorkStealingThreadPool();
// Use ConnectableObservable to make sure mf is called only once
// no matter how many downstream observers
ConnectableObservable<Output> outputs = inputs
.observeOn(SchedulersFrom(exec))
.map(mf)
.publish();
outputs.subscribe(myObserver1);
outputs.subscribe(myObserver2);
outputs.connect();
inputs.onNext(slowInput); // `mf.apply()` takes a long time to complete on this input
inputs.onNext(fastInput); // `mf.apply()` takes a short time to complete on this input
but in testing, mf.apply(fastInput) is never called till after mf.apply(slowInput) completes.
If I play some tricks in my test with CountDownLatch to ensure mf.apply(slowInput) can't complete until after mf.apply(fastInput), the program deadlocks.
Is there some simple operator I should be using here, or is getting Observables out of order just against the grain of RxJava, and I should be using a different technology?
ETA: I looked at using ParallelFlowable (converting it back to a plain Flowable with .sequential() before subscribing myObserver1/2, or rather mySubscriber1/2), but then I get extra mf.apply() calls, one per input per Subscriber. There's ConnectableFlowable, but I'm not having much luck figuring out how to mix it with .parallel().
I guess observeOn operator does not support concurrent execution for alone. So, how about using flatMap? Assume the mf function needs a lot time.
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
or
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.map(mf))
.subscribeOn(SchedulersFrom(exec))
.publish();
Edit 2019-12-30
If you want to run tasks concurrently, but supposed to keep the order, use concatMapEager operator instead of flatMap.
ConnectableObservable<Output> outputs = inputs
.concatMapEager(it -> Observable.just(it) // here
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
Doesn't sound possible to me, unless Rx has some very specialised operator to do so. If you're using flatMap to do the mapping, then the elements will arrive out-of-order. Or you could use concatMap but then you'll lose the parallel mapping that you want.
Edit: As mentioned by another poster, concatMapEager should work for this. Parallel subscription and in-order results.
What type of parallel Python approach would be suited to efficiently spreading the CPU bound workload shown below. Is it feasible to parallelize the section? It looks like there is not much tight coupling between the loop iterations i.e. portions of the loop could be handled in parallel so long as an appropriate communication to reconstruct the store variable is done at the end. I'm currently using Python2.7, but if a strong case could be made that this problem can be easily handled in a newer version, then I will consider migrating the code base.
I have tried to capture the spirit of the computation with the example below. I believe that it has the same connectedness between the loops/variables as my actual code.
nx = 20
ny = 30
myList1 = [0]*100
myList2 = [1]*25
value1 = np.zeros(nx)
value2 = np.zeros(ny)
store = np.zeros(nx,ny,len(myList1),len(myList2))
for i in range(nx):
for j in range(ny):
f = calc(value1[i],value2[j]) #returns a list
for k,data1 in enumerate(myList1):
for p,data2 in enumerate(myList2):
meanval = np.sum(f[:]/data1)*data2
store[i,j,k,p] = meanval
Here are two approaches you can take. What's wise also depends on where the bottleneck is, which is something that can best be measured rather than guessed.
The ideal option would be to leave all low level optimization to Numpy. Right now you have a mix of native Python code and Numpy code. The latter doesn't play well with loops. They work, of course, but by having loops in Python, you force operations to take place sequentially in the order you specified. It's better to give Numpy operations that it can perform on as many elements at once as possible, i.e. matrix transformations. That benefits performance, not only because of automatic (partial) parallelization; even single threads will be able to get more out of the CPU. A highly recommended read to learn more about this is From Python to Numpy.
If you do need to parallelize pure Python code, you have few options but to go with multiple processes. For that, refer to the multiprocessing module. Rearrange the code into three steps:
Preparing the inputs for every job
Dividing those jobs between a pool of workers to be run in parallel (fork/map)
Collecting the results (join/reduce)
You need to strike a balance between enough processes to make parallelizing worthwhile, and not so many that they will be too short-lived. The cost of spinning up processes and communicating with them would then become significant by itself.
A simple solution would be to generate a list of (i,j) pairs, so that there will nx*ny jobs. Then make a function that takes such pair as input and returns a list of (i,j,k,p,meanval). Try to only use the inputs to the function and return a result. Everything local; no side effects et cetera. Read-only access to globals such as myList1 is okay, but modification requires special measures as described in the documentation. Pass the function and the list of inputs to a worker pool. Once it has finished producing partial results, combine all those into your store.
Here's an example:
from multiprocessing import Pool
import numpy as np
# Global variables are OK, as long as their contents are not modified, although
# these might just as well be moved into the worker function or an initializer
nx = 20
ny = 30
myList1 = [0]*100
myList2 = [1]*25
value1 = np.zeros(nx)
value2 = np.zeros(ny)
def calc_meanvals_for(pair):
"""Process a reasonably sized chunk of the problem"""
i, j = pair
f = calc(value1[i], value2[j])
results = []
for k, data1 in enumerate(myList1):
for p, data2 in enumerate(myList2):
meanval = np.sum(f[:]/data1)*data2
results.append((i,j,k,p,meanval))
return results
# This module will be imported by every worker - that's how they will be able
# to find the global variables and the calc function - so make sure to check
# if this the main program, because without that, every worker will start more
# workers, each of which will start even more, and so on, in an endless loop
if __name__ == '__main__':
# Create a pool of worker processes, each able to use a CPU core
pool = Pool()
# Prepare the arguments, one per function invocation (tuples to fake multiple)
arg_pairs = [(i,j) for i in range(nx) for j in range(ny)]
# Now comes the parallel step: given a function and a list of arguments,
# have a worker invoke that function with one argument until all arguments
# have been used, collecting the return values in a list
return_values = pool.map(calc_meanvals_for, arg_pairs)
# Since the function also returns a list, there's now a list of lists - consider
# itertools.chain.from_iterable to flatten them - to be processed further
store = np.zeros(nx, ny, len(myList1), len(myList2))
for results in return_values:
for i, j, k, p, meanval in results:
store[i,j,k,p] = meanval
I have a genetic algorithm which I would like to speed up. I'm thinking the easiest way to achieve this is by pythons multiprocessing module. After running cProfile on my GA, I found out that most of the computational time takes place in the evaluation function.
def evaluation():
scores = []
for chromosome in population:
scores.append(costly_function(chromosome))
How would I go about to parallelize this method? It is important that all the scores append in the same order as they would if the program would run sequentially.
I'm using python 2.7
Use pool (I show both imap and map because of some results on google say map may not be OK for ordering though I have yet to see proof):
from multiprocessing import Pool
def evaluation(population):
return list(Pool(processes=nprocs).imap(costly_function,population))
or (what I use):
return Pool(processes=nprocs).map(costly_function,population)
Define nprocs to the number of parallel process you want.
From:
https://docs.python.org/dev/library/multiprocessing.html#multiprocessing.pool.Pool
I'm writing an object which draws from a multiprocessing queue, and I found that when I run this code, I get data = []. Whereas if I tell the program to sleep for a little bit at the place denoted, I get data = [1,2], as it should.
from multiprocessing import Queue
import time
q = Queue()
taken = 0
data = []
for i in [1,2]:
q.put(i)
# time.sleep(1) making this call causes the correct output
while not q.empty() and taken < 2:
try:
data.append(q.get(timeout=1))
taken+=1
except Empty:
continue
**EDIT:**This also happens if there's a print statement before the while loop. This suggests to me that there's something happening with the calls to q.put() that's happening, but I can't find any documentation on this issue.
It's mentioned in the docs for multiprocessing.Queue
Note When an object is put on a queue, the object is pickled and a
background thread later flushes the pickled data to an underlying
pipe. This has some consequences which are a little surprising, but
should not cause any practical difficulties – if they really bother
you then you can instead use a queue created with a manager.
After
putting an object on an empty queue there may be an infinitesimal
delay before the queue’s empty() method returns False and get_nowait()
can return without raising Queue.Empty.
...