Parallelizing a nested Python for loop

Parallelizing a nested Python for loop - python-2.7

What type of parallel Python approach would be suited to efficiently spreading the CPU bound workload shown below. Is it feasible to parallelize the section? It looks like there is not much tight coupling between the loop iterations i.e. portions of the loop could be handled in parallel so long as an appropriate communication to reconstruct the store variable is done at the end. I'm currently using Python2.7, but if a strong case could be made that this problem can be easily handled in a newer version, then I will consider migrating the code base.
I have tried to capture the spirit of the computation with the example below. I believe that it has the same connectedness between the loops/variables as my actual code.
nx = 20
ny = 30
myList1 = [0]*100
myList2 = [1]*25
value1 = np.zeros(nx)
value2 = np.zeros(ny)
store = np.zeros(nx,ny,len(myList1),len(myList2))
for i in range(nx):
for j in range(ny):
f = calc(value1[i],value2[j]) #returns a list
for k,data1 in enumerate(myList1):
for p,data2 in enumerate(myList2):
meanval = np.sum(f[:]/data1)*data2
store[i,j,k,p] = meanval

Here are two approaches you can take. What's wise also depends on where the bottleneck is, which is something that can best be measured rather than guessed.
The ideal option would be to leave all low level optimization to Numpy. Right now you have a mix of native Python code and Numpy code. The latter doesn't play well with loops. They work, of course, but by having loops in Python, you force operations to take place sequentially in the order you specified. It's better to give Numpy operations that it can perform on as many elements at once as possible, i.e. matrix transformations. That benefits performance, not only because of automatic (partial) parallelization; even single threads will be able to get more out of the CPU. A highly recommended read to learn more about this is From Python to Numpy.
If you do need to parallelize pure Python code, you have few options but to go with multiple processes. For that, refer to the multiprocessing module. Rearrange the code into three steps:
Preparing the inputs for every job
Dividing those jobs between a pool of workers to be run in parallel (fork/map)
Collecting the results (join/reduce)
You need to strike a balance between enough processes to make parallelizing worthwhile, and not so many that they will be too short-lived. The cost of spinning up processes and communicating with them would then become significant by itself.
A simple solution would be to generate a list of (i,j) pairs, so that there will nx*ny jobs. Then make a function that takes such pair as input and returns a list of (i,j,k,p,meanval). Try to only use the inputs to the function and return a result. Everything local; no side effects et cetera. Read-only access to globals such as myList1 is okay, but modification requires special measures as described in the documentation. Pass the function and the list of inputs to a worker pool. Once it has finished producing partial results, combine all those into your store.
Here's an example:
from multiprocessing import Pool
import numpy as np
# Global variables are OK, as long as their contents are not modified, although
# these might just as well be moved into the worker function or an initializer
nx = 20
ny = 30
myList1 = [0]*100
myList2 = [1]*25
value1 = np.zeros(nx)
value2 = np.zeros(ny)
def calc_meanvals_for(pair):
"""Process a reasonably sized chunk of the problem"""
i, j = pair
f = calc(value1[i], value2[j])
results = []
for k, data1 in enumerate(myList1):
for p, data2 in enumerate(myList2):
meanval = np.sum(f[:]/data1)*data2
results.append((i,j,k,p,meanval))
return results
# This module will be imported by every worker - that's how they will be able
# to find the global variables and the calc function - so make sure to check
# if this the main program, because without that, every worker will start more
# workers, each of which will start even more, and so on, in an endless loop
if __name__ == '__main__':
# Create a pool of worker processes, each able to use a CPU core
pool = Pool()
# Prepare the arguments, one per function invocation (tuples to fake multiple)
arg_pairs = [(i,j) for i in range(nx) for j in range(ny)]
# Now comes the parallel step: given a function and a list of arguments,
# have a worker invoke that function with one argument until all arguments
# have been used, collecting the return values in a list
return_values = pool.map(calc_meanvals_for, arg_pairs)
# Since the function also returns a list, there's now a list of lists - consider
# itertools.chain.from_iterable to flatten them - to be processed further
store = np.zeros(nx, ny, len(myList1), len(myList2))
for results in return_values:
for i, j, k, p, meanval in results:
store[i,j,k,p] = meanval

Related

Using "map function" in python to reduce time of processing

I am trying to run a loop for 100,000 times. I have used map function as shown below to divide the work between processors and make it less time consuming.
But also I have to pass the variable as argument to the map function due to which it consumes more time as compared to when I define this variable inside the main function. But the problem with define the variable inside the main function is - this variable is generated by random function hence when different processors come to pick function every time it give new random gussian plot and this is not required.
Hence- as a solution I defined the gussian random function out of the main function and passed as an argument to main function. But now the map is consuming more time to process. Can any one please help to reduce the time of map processing or suggest me where to define the random gussian variable so that it is calculated once and picked by different processors.
Defining random gussian variable to pass as an argument to map function
Code
def E_and_P(Velocity_s, Position_s,tb):
~
~
for index in range(0,4000):
return X_position,Y_position,Z_position, VX_Vel, VY_Vel
if __name__ == "__main__":
Velocity_mu = 0
Velocity_sigma = 1*1e8 # mean and standard deviation
Velocity_s = np.random.normal(Velocity_mu, Velocity_sigma, 100000)
print("Velocity_s =", Velocity_s)
#print("Velocity_s=", Velocity_s)
Position_mu = 0
Position_sigma = 1*1e-9 # mean and standard deviation
Position_s = np.random.normal(Position_mu, Position_sigma, 100000)
#print("Position_s =", Position_s)
tb = range(100000)
#print("tb=",tb)
items = [(Velocity_s, Position_s,tb) for tb in range(100000)]
p = Pool(processes=4)
result = p.starmap(E_and_P, items)
p.close()
p.join()
Please help or suggest some new ways.

Based on your last comment, you could change this line:
items = [(Velocity_s, Position_s,tb) for tb in range(100000)]
to:
items = [(Velocity_s[tb], Position_s[tb],tb) for tb in range(100000)]
Each element of items is now a simple 3-number tuple: one velocity, one position, and one index.
You will also have to change E_and_P since its arguments are now 3 scalers: one velocity, one position, and one index.
def E_and_P(vel, pos, tb):
~
~
This should dramatically improve the performance. When using multiprocessing, keep in mind that different processes do not share an address space. All the data that gets exchanged between processes must be digitized on one end and rebuilt as Python objects on the other end. As I indicated in my comment, your original implementation resulted in about 20 billion digitizations. This approach still has 100000 steps, but each step needs to digitize only 3 numbers. So 300000 digitizations instead of 2000000000.

Parallelize map() operation on single Observable and receive results out of order

Given an Observable<Input> and a mapping function Function<Input, Output> that is expensive but takes variable time, is there a way to call the mapping function in parallel on multiple inputs, and receive the outputs in the order they're produced?
I've tried using observeOn() with a multi-threaded Scheduler:
PublishSubject<Input> inputs = PublishSubject.create();
Function<Input, Output> mf = ...
Observer<Output> myObserver = ...
// Note: same results with newFixedThreadPool(2)
Executor exec = Executors.newWorkStealingThreadPool();
// Use ConnectableObservable to make sure mf is called only once
// no matter how many downstream observers
ConnectableObservable<Output> outputs = inputs
.observeOn(SchedulersFrom(exec))
.map(mf)
.publish();
outputs.subscribe(myObserver1);
outputs.subscribe(myObserver2);
outputs.connect();
inputs.onNext(slowInput); // `mf.apply()` takes a long time to complete on this input
inputs.onNext(fastInput); // `mf.apply()` takes a short time to complete on this input
but in testing, mf.apply(fastInput) is never called till after mf.apply(slowInput) completes.
If I play some tricks in my test with CountDownLatch to ensure mf.apply(slowInput) can't complete until after mf.apply(fastInput), the program deadlocks.
Is there some simple operator I should be using here, or is getting Observables out of order just against the grain of RxJava, and I should be using a different technology?
ETA: I looked at using ParallelFlowable (converting it back to a plain Flowable with .sequential() before subscribing myObserver1/2, or rather mySubscriber1/2), but then I get extra mf.apply() calls, one per input per Subscriber. There's ConnectableFlowable, but I'm not having much luck figuring out how to mix it with .parallel().

I guess observeOn operator does not support concurrent execution for alone. So, how about using flatMap? Assume the mf function needs a lot time.
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
or
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.map(mf))
.subscribeOn(SchedulersFrom(exec))
.publish();
Edit 2019-12-30
If you want to run tasks concurrently, but supposed to keep the order, use concatMapEager operator instead of flatMap.
ConnectableObservable<Output> outputs = inputs
.concatMapEager(it -> Observable.just(it) // here
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();

Doesn't sound possible to me, unless Rx has some very specialised operator to do so. If you're using flatMap to do the mapping, then the elements will arrive out-of-order. Or you could use concatMap but then you'll lose the parallel mapping that you want.
Edit: As mentioned by another poster, concatMapEager should work for this. Parallel subscription and in-order results.

Parallelize for loop in python

I have a genetic algorithm which I would like to speed up. I'm thinking the easiest way to achieve this is by pythons multiprocessing module. After running cProfile on my GA, I found out that most of the computational time takes place in the evaluation function.
def evaluation():
scores = []
for chromosome in population:
scores.append(costly_function(chromosome))
How would I go about to parallelize this method? It is important that all the scores append in the same order as they would if the program would run sequentially.
I'm using python 2.7

Use pool (I show both imap and map because of some results on google say map may not be OK for ordering though I have yet to see proof):
from multiprocessing import Pool
def evaluation(population):
return list(Pool(processes=nprocs).imap(costly_function,population))
or (what I use):
return Pool(processes=nprocs).map(costly_function,population)
Define nprocs to the number of parallel process you want.
From:
https://docs.python.org/dev/library/multiprocessing.html#multiprocessing.pool.Pool

Map-style parallel processing?

I'm trying to do some very simple parallel processing in python. I have a list of data, and I want to compute the exact same thing for each element, and return it as a list, so I looked into some simple map-style modules available (https://wiki.python.org/moin/ParallelProcessing).
I previously used the pprocess module, but it does not seem work this time. I looked into using either forkmap or forkfun, but I haven't really found some nice examples on how to use them.
What would you recommend as the easiest to use map-style parallel processing module? Preferably with a tutorial of some sort.

First off Im not sure having multiple threads would make your program go much faster (but would be curious to see what is the speed up)
I would not use a special tutorial/module and just use basic process/threading stuff
from multiprocessing import Process, Lock, Queue
Output values out of map into the queue (Queue())
processess = []
results_queue = Queue()
for i in xrange(50):
p = Process(target=MyMapFunction, args=tab[i*50:(i+1) * 50])
processess.append(p)
p.start()
# Waiting and Reducing...
all_key_values = {}
for _ in xrange(50):
for k, v in results_queue.get():
all_key_values.setdefault(k, []).append(v)
# Some sort of check that threads are done but they should be
for p in processess:
p.join()
def MyMapFunction(tab):
return [(x, 2 * x) for x in tab]
I let you do the reduce the same way I did the map and correct the shitty i*50 : (i+1) * 50 that I wrote quickly to give an example
This is the pattern I use when I want to multithread in Python

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.
Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?
Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...
Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...
Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.
The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf
Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.
Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.

Hypothesis: Method 2 is somehow blocking the update from getting written by the server.
One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).
The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.
You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.
You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.

I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.
Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).
However, what can actually happen is that you have such a line of cache:
Line of cache = [ID1, ID2, ID3, ID4, ...]
^ ^
client server
Which then creates contention.
Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.
Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)

I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?
Either way, method 2 doesn't have any sleeps which means the following code...
while(true)
{
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.

The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.
The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:
A
B
A
B
A // and I start looking at the list here
B
But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js