Why does this queue behave like this? - python-2.7

I'm writing an object which draws from a multiprocessing queue, and I found that when I run this code, I get data = []. Whereas if I tell the program to sleep for a little bit at the place denoted, I get data = [1,2], as it should.
from multiprocessing import Queue
import time
q = Queue()
taken = 0
data = []
for i in [1,2]:
q.put(i)
# time.sleep(1) making this call causes the correct output
while not q.empty() and taken < 2:
try:
data.append(q.get(timeout=1))
taken+=1
except Empty:
continue
**EDIT:**This also happens if there's a print statement before the while loop. This suggests to me that there's something happening with the calls to q.put() that's happening, but I can't find any documentation on this issue.

It's mentioned in the docs for multiprocessing.Queue
Note When an object is put on a queue, the object is pickled and a
background thread later flushes the pickled data to an underlying
pipe. This has some consequences which are a little surprising, but
should not cause any practical difficulties – if they really bother
you then you can instead use a queue created with a manager.
After
putting an object on an empty queue there may be an infinitesimal
delay before the queue’s empty() method returns False and get_nowait()
can return without raising Queue.Empty.
...

Related

Parallelize map() operation on single Observable and receive results out of order

Given an Observable<Input> and a mapping function Function<Input, Output> that is expensive but takes variable time, is there a way to call the mapping function in parallel on multiple inputs, and receive the outputs in the order they're produced?
I've tried using observeOn() with a multi-threaded Scheduler:
PublishSubject<Input> inputs = PublishSubject.create();
Function<Input, Output> mf = ...
Observer<Output> myObserver = ...
// Note: same results with newFixedThreadPool(2)
Executor exec = Executors.newWorkStealingThreadPool();
// Use ConnectableObservable to make sure mf is called only once
// no matter how many downstream observers
ConnectableObservable<Output> outputs = inputs
.observeOn(SchedulersFrom(exec))
.map(mf)
.publish();
outputs.subscribe(myObserver1);
outputs.subscribe(myObserver2);
outputs.connect();
inputs.onNext(slowInput); // `mf.apply()` takes a long time to complete on this input
inputs.onNext(fastInput); // `mf.apply()` takes a short time to complete on this input
but in testing, mf.apply(fastInput) is never called till after mf.apply(slowInput) completes.
If I play some tricks in my test with CountDownLatch to ensure mf.apply(slowInput) can't complete until after mf.apply(fastInput), the program deadlocks.
Is there some simple operator I should be using here, or is getting Observables out of order just against the grain of RxJava, and I should be using a different technology?
ETA: I looked at using ParallelFlowable (converting it back to a plain Flowable with .sequential() before subscribing myObserver1/2, or rather mySubscriber1/2), but then I get extra mf.apply() calls, one per input per Subscriber. There's ConnectableFlowable, but I'm not having much luck figuring out how to mix it with .parallel().
I guess observeOn operator does not support concurrent execution for alone. So, how about using flatMap? Assume the mf function needs a lot time.
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
or
ConnectableObservable<Output> outputs = inputs
.flatMap(it -> Observable.just(it)
.map(mf))
.subscribeOn(SchedulersFrom(exec))
.publish();
Edit 2019-12-30
If you want to run tasks concurrently, but supposed to keep the order, use concatMapEager operator instead of flatMap.
ConnectableObservable<Output> outputs = inputs
.concatMapEager(it -> Observable.just(it) // here
.observeOn(SchedulersFrom(exec))
.map(mf))
.publish();
Doesn't sound possible to me, unless Rx has some very specialised operator to do so. If you're using flatMap to do the mapping, then the elements will arrive out-of-order. Or you could use concatMap but then you'll lose the parallel mapping that you want.
Edit: As mentioned by another poster, concatMapEager should work for this. Parallel subscription and in-order results.

Map-style parallel processing?

I'm trying to do some very simple parallel processing in python. I have a list of data, and I want to compute the exact same thing for each element, and return it as a list, so I looked into some simple map-style modules available (https://wiki.python.org/moin/ParallelProcessing).
I previously used the pprocess module, but it does not seem work this time. I looked into using either forkmap or forkfun, but I haven't really found some nice examples on how to use them.
What would you recommend as the easiest to use map-style parallel processing module? Preferably with a tutorial of some sort.
First off Im not sure having multiple threads would make your program go much faster (but would be curious to see what is the speed up)
I would not use a special tutorial/module and just use basic process/threading stuff
from multiprocessing import Process, Lock, Queue
Output values out of map into the queue (Queue())
processess = []
results_queue = Queue()
for i in xrange(50):
p = Process(target=MyMapFunction, args=tab[i*50:(i+1) * 50])
processess.append(p)
p.start()
# Waiting and Reducing...
all_key_values = {}
for _ in xrange(50):
for k, v in results_queue.get():
all_key_values.setdefault(k, []).append(v)
# Some sort of check that threads are done but they should be
for p in processess:
p.join()
def MyMapFunction(tab):
return [(x, 2 * x) for x in tab]
I let you do the reduce the same way I did the map and correct the shitty i*50 : (i+1) * 50 that I wrote quickly to give an example
This is the pattern I use when I want to multithread in Python

How to catch/monitor/link gevent.sleep() exceptions

I'm using web2py for a project and found that gevent.sleep seems to hang in unexpected disconnects. I'm guessing this is due to improperly handled exception. I can not find it properly written into the documentation, how do I catch, link, or monitor exceptions from gevent.sleep()?
Thank you in advance.
Strange guess, it might be wrong. sleep() suspends current Greenlet and resumes next, pending, Greenlet. Most likely it is next Geenlet that runs after sleep() that blocks execution.
If you don't see traceback printed out it is not coming from sleep().
Source code of sleep function:
def sleep(seconds=0):
"""Put the current greenlet to sleep for at least *seconds*.
*seconds* may be specified as an integer, or a float if fractional seconds
are desired.
If *seconds* is equal to or less than zero, yield control the other coroutines
without actually putting the process to sleep. The :class:`core.idle` watcher
with the highest priority is used to achieve that.
"""
hub = get_hub()
loop = hub.loop
if seconds <= 0:
watcher = loop.idle()
watcher.priority = loop.MAXPRI
else:
watcher = loop.timer(seconds)
hub.wait(watcher)

catching a deadlock in a simple odd-even sending

I'm trying to solve a simple problem with MPI, my implementation is MPICH2 and my code is in fortran. I have used the blocking send and receive, the idea is so simple but when I run it it crashes!!! I have absolutely no idea what is wrong? can anyone make quote on this issue please? there is a piece of the code:
integer, parameter :: IM=100, JM=100
REAL, ALLOCATABLE :: T(:,:), TF(:,:)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,RNK,IERR)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,SIZ,IERR)
prv = rnk-1
nxt = rnk+1
LIM = INT(IM/SIZ)
IF (rnk==0) THEN
ALLOCATE(TF(IM,JM))
prv = MPI_PROC_NULL
ELSEIF(rnk==siz-1) THEN
NXT = MPI_PROC_NULL
LIM = LIM+MOD(IM,SIZ)
END IF
IF (MOD(RNK,2)==0) THEN
CALL MPI_SEND(T(2,:),JM+2,MPI_REAL,PRV,10,MPI_COMM_WORLD,IERR)
CALL MPI_RECV(T(1,:),JM+2,MPI_REAL,PRV,20,MPI_COMM_WORLD,STAT,IERR)
ELSE
CALL MPI_RECV(T(LIM+2,:),JM+2,MPI_REAL,NXT,10,MPI_COMM_WORLD,STAT,IERR)
CALL MPI_SEND(T(LIM+1,:),JM+2,MPI_REAL,NXT,20,MPI_COMM_WORLD,IERR)
END IF
as I understood even processes are not receiving anything while the odd ones finish sending successfully, in some cases when I added some print to observe what is going on I saw that the variable NXT is changing during the sending procedure!!! for example all the odd process was sending message to process 0 not their next one!
The array T is not allocated so reading or writing from it is an error.
I can't see the whole program, but some observation on what I can see:
1) make sure that rnk, size, and prv are integers. Likely, prv is real (by default
typing rules) and you are sending a real to an integer, so tags don't match, hence deadlock.
2) I'd use sendrcv rather than send / recv ; recv / send code section. Two sendrecv's are cleaner ( 2 lines of code vs. 7), guaranteed not to deadlock, and is faster when you have bi-directional links (almost always true.)

Of these 3 methods for reading linked lists from shared memory, why is the 3rd fastest?

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.
Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?
Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':
void method_one()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
Data* current_item = channel_cursors[i];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...
Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.
void method_two()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
::uint32_t update_id = update_cursor->list_id_;
Data* current_item = channel_cursors[update_id];
if(current_item->state_ == NOT_FILLED_YET) {
std::cerr << "This should never print." << std::endl; // it doesn't
continue;
}
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
}
Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...
Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.
void method_three()
{
std::vector<Data*> channel_cursors;
for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
{
Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
channel_cursors.push_back(current_item);
}
UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));
while(true)
{
for(std::size_t i = 0; i < channel_list.size(); ++i)
{
std::size_t idx = i;
ACQUIRE_MEMORY_BARRIER;
if(update_cursor->state_ != NOT_FILLED_YET) {
//std::cerr << "Found via update" << std::endl;
i--;
idx = update_cursor->list_id_;
update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
}
Data* current_item = channel_cursors[idx];
ACQUIRE_MEMORY_BARRIER;
if(current_item->state_ == NOT_FILLED_YET) {
continue;
}
found_an_update = true;
log_latency(current_item->tv_sec_, current_item->tv_usec_);
channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
}
}
}
The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.
The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf
Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.
Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.
Hypothesis: Method 2 is somehow blocking the update from getting written by the server.
One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).
The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.
You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.
You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.
I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.
Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).
However, what can actually happen is that you have such a line of cache:
Line of cache = [ID1, ID2, ID3, ID4, ...]
^ ^
client server
Which then creates contention.
Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.
Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)
I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?
Either way, method 2 doesn't have any sleeps which means the following code...
while(true)
{
if(update_cursor->state_ == NOT_FILLED_YET) {
continue;
}
is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.
The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.
The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:
A
B
A
B
A // and I start looking at the list here
B
But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.