RxJava - One producer, many concurrent consumers in single subscription - concurrency

I'm trying to get my bearings around some details of RxJava concurrency and I'm not sure if what I have in mind is correct. I have a good understanding of how SubscribeOn/ObserveOn work, but I'm trying to nail down some particulars of the pool schedulers. For that, I'm looking at implementing a 1-N producer-consumer chain with as many consumers as CPUs as simply as possible.
According to docs, Schedulers.computation() is backed by a pool of as many threads as cores. However, per the Reactive contract, an operator gets only sequential calls.
Hence, a setup like this one
Observable.range(1, 1000) // Whatever has to be processed
.observeOn(Schedulers.computation())
.doOnNext(/* heavy computation */)
.doOnCompleted(() -> System.out.println("COMPLETED"))
.forEach(System.out::println);
despite using a thread pool will only receive a concurrent call to doOnNext. Experiments with sleep an inspecting OperatorObserveOn.java seem to confirm this, since a worker is obtained per observeOn call. Also, if it were otherwise there should be a complicated management of OnCompleted having to wait for any pending OnNext to complete, which I don't find exists.
Supposing I'm on the right track here (that is, that only a thread is involved, although you can jump around several of them with observeOn), what would be then the proper pattern? I can find examples for the converse situation (synchronize several async event generators into one consumer) but not a simple example for this typical case.
I guess flatMap is involved, perhaps using the beta version (in 1.x) that limits the number of concurrent subscriptions. Could perhaps be as simple as using window/flatMap like this?
Observable
.range(1, 1000) // Whatever has to be processed
.window(1) // Emit one observable per item, for example
.flatMap(/* Processing */, 4) // For 4-concurrent processing
.subscribe()
In this approach I'm still missing an easy way of maxing the CPU in a Rx-generic way (that is, specifying the computation Scheduler instead of a maximum of subscriptions to flatMap). So, perhaps...:
Observable
.range(1, 1000) // Whatever has to be processed
.window(1) // Emit one observable per item, for example
.flatMap(v -> Observable.just(v)
.observeOn(Schedulers.computation())
.map(/* heavy parallel computation */))
.subscribe()
Lastly, in some examples with flatMap I see a toBlock() call after flatMap which I'm not sure why is needed, since shouldn't flatMap be performing the serialization for downstream? (E.g. in this example: http://akarnokd.blogspot.com.es/2016/02/flatmap-part-1.html)

There's a good article by Thomas Nield on exactly that case
RxJava - Maximizing Parallelization
What I personally do in that case, I just subscribe with Schedulers.io in a flatMap with a maximum concurrent calls parameter.
Observable.range(1, 1000) // Whatever has to be processed
.flatMap(v -> Observable.fromCallable(() -> { /* io calls */})
.subscribeOn(Schedulers.io()), Runtime.getRuntime().availableProcessors() + 1)
.subscribe();
EDIT
as per suggestion in comments it's better to use Schedulers.computation() for CPU bound work
Observable.range(1, 1000) // Whatever has to be processed
.flatMap(v -> Observable.fromCallable(() -> { /* intense calculation */})
.subscribeOn(Schedulers.computation()))
.subscribe();

Related

When does .race or .hyper outperform non-data-parallelized versions?

I have this code:
# Grab Nutrients.csv from https://data.nal.usda.gov/dataset/usda-branded-food-products-database/resource/c929dc84-1516-4ac7-bbb8-c0c191ca8cec
my #nutrients = "/path/to/Nutrients.csv".IO.lines;
for #nutrients.race {
my #data = $_.split('","');
.say if #data[2] eq "Protein" and #data[4] > 70 and #data[5] ~~ /^g/;
};
Nutrients.csv is a 174 MB file, with lots of rows. Non-trivial stuff is done on every row, but there's no data dependency. However, this takes circa 54s while the non-race version uses 43 seconds, 20% less. Any idea of why that happens? Is the kind of operation done here still too little for data parallelism to take hold? I have seen it only working with very heavy operations, like checking if something is prime. In that case, any ballpark of how much should be done for every piece of data to make data parallelism worth the while?
Assuming that "outperform" is defined as "using less wallclock":
Short answer: when it does.
Longer answer: when the overhead of batching values, distributing over multiple threads and collecting results + the actual CPU that is needed for the work divided by the number of threads, results in a shorter runtime.
Still longer answer: the dispatcher thread needs some CPU to batch up values and hand the work over to a worker thread and then process its result. As long as that amount of CPU is more than the amount of CPU needed to do the work, you will only use one thread (because by the time the dispatcher thread is ready to dispatch, the only worker thread is ready to receive more work). Which means you've made things worse, because the actual work is now still being done by one thread, but you've added a lot of overhead and latency.
So make sure that the amount of work a worker thread needs to do, is big enough so that the dispatcher thread will need to start up another thread for the next piece of work. This can be done by increasing the batch-size. But a bigger batch, also means that the dispatcher thread will need more CPU to create the batch. Which in turn can make the worker thread be ready to receive the next batch, in which case you're back to just having added overhead.
There are still plans to make the batch size adapt itself automatically to the amount of work that a worker thread needs to do. But unfortunately, that will also require quite an extensive reworking of the current implementation of hyper and race. So don't expect that any time soon, and definitely not before the Great Dispatcher Overhaul has landed.
Please have a look at:
Raku .hyper() and .race() example not working
The syntax in your example should be:
my #nutrients = "/path/to/Nutrients.csv".IO.lines;
race for #nutrients.race(batch => 1, degree => 2)
{
my #data = $_.split('","');
.say if #data[2] eq "Protein" and #data[4] > 70 and #data[5] ~~ /^g/;
}
The "race" in front of the "for" makes the difference.

At which point should actors' behaviors be split

I'm completely new to Akka. I am having a hard time grasping when I should split what used to be class methods/behavior into akka messages. Many examples show the messages received as one line - println("Howdy").
Let's assume I want to do the following:
Given a predefined set of regular expressions
Given an input stream of sentences from a book. Each message is a sentence.
Perform regular expression on the sentence
Increment count of matches and non-matches for the regular expression
If match, Perform HTTP post the sentence.
What is the guideline that akka experts use in their head to break this up? Would I make each step here a separate message rather than several method calls? In my head, The only thing I would use an akka message for, would be #1 (each message), and #6 (blocking http call). That would make my handling of each sentence actually perform a decent amount of work (but non-blocking work). Would it be similar to when I decide to use async over not using async? Which to me, is only ever when I have the chance for a blocking operation.
I assume that a state you want to track in your actor is the number of matches per regular expression.
In this case, your initial approach is valid in case you don't have a lot of regular expressions. If you have a lot of them, and every sentence goes through every expression, you will perform a lot of work on the actor thread. This work is non-blocking in the sense that there is no I/O, but it is actually prevents the progress of other messages sent to this actor, so it is blocking in this sense. Actors are single-threaded, so if you have a lot of incoming sentences actor's mailbox will start to grow. If you use an unbounded mailbox (which is default) you'll eventually go OOM.
One solution would be to dispatch regex matching to a Future. However, you can't share actor state (which is match count per regex) with that future, because (in general case) it will cause race conditions. To work around this, the result of your future will send another message to your actor with counts that need to be updated.
case class ProcessSentence(s: String)
case class ProcessParseResult(hits: mutable.Map[Regex,Int], s: String)
case class Publish(s: String)
class ParseActor {
val regexHits = Map("\\s+".r -> 0, "foo*".r -> 0)
def receive = {
case ProcessSentence(s) => Future(parseSentence(s, regexHits.keys)).pipeTo(self)
case ProcessParseResult(update, s) =>
// update regexHits map
if(update.values.sum > 0)
self ! Publish(s)
case Publish(s) => Future(/* send http request */)
}
def parseSentence(s: String, regexes: Seq[Regex]): Future[ProcessParseResult] =
Future{ /* match logic */}
}
The way I recommend approaching designing with Akka is to first identify the state in your problem.
Let's start there:
If you have state, evaluate if an actor is a simpler approach than synchronization and locking
If you don't have state, evaluate if you actually need to use an actor
If you have state, then an actor may be a good fit as it can help ensure you safely handle concurrent access to the state in the actor, and its fault tolerance mechanisms will help your application recover if that state becomes corrupt causing application errors.
In your case, a simple counter or two can be handled with Java's atomic integers, so I would actually recommend that you use a basic class instead of an actor. It will be much simpler than using an actor. If you want to return a future, you can use Java8s CompletableFuture or Scala's concurrent.Future and the result will be simpler than using an actor.
With all of that said, if you DO want to use an actor, the design might not warrant multiple actors because you only have one piece of state.
As a general rule, your actors should have only one responsibility - much like good object oriented design. While you might accept a message in your one single actor, you may decide to split the logic into multiple classes still. So you might have your 'RegexActor' and then you might have 'MatchBehavior' implemented with strategy pattern that the RegexActor is passed on construction.
All in, I don't think you need to build a bunch of actors - actors give you asynchronous boundaries between things but you will still benefit from using good OO in the behaviours that the actor has. The actor has one message - that of handling lines of the book - but it has a few behaviours that can be composed in several different classes. I would use basic classes for that behaviour and have the actor delegate to those other classes after it receives the message.
You want things to be simple - there is a cost of loss of type safety and added code when using actors so I recommend you make sure there is a good reason for using it - either state that exists in a concurrent environment or distribution.

thread building block combined with pthreads

I have a queue with elements which needs to be processed. I want to process these elements in parallel. The will be some sections on each element which need to be synchronized. At any point in time there can be max num_threads running threads.
I'll provide a template to give you an idea of what I want to achieve.
queue q
process_element(e)
{
lock()
some synchronized area
// a matrix access performed here so a spin lock would do
unlock()
...
unsynchronized area
...
if( condition )
{
new_element = generate_new_element()
q.push(new_element) // synchonized access to queue
}
}
process_queue()
{
while( elements in q ) // algorithm is finished condition
{
e = get_elem_from_queue(q) // synchronized access to queue
process_element(e)
}
}
I can use
pthreads
openmp
intel thread building blocks
Top problems I have
Make sure that at any point in time I have max num_threads running threads
Lightweight synchronization methods to use on queue
My plan is to the intel tbb concurrent_queue for the queue container. But then, will I be able to use pthreads functions ( mutexes, conditions )? Let's assume this works ( it should ). Then, how can I use pthreads to have max num_threads at one point in time? I was thinking to create the threads once, and then, after one element is processes, to access the queue and get the next element. However it if more complicated because I have no guarantee that if there is not element in queue the algorithm is finished.
My question
Before I start implementing I'd like to know if there is an easy way to use intel tbb or pthreads to obtain the behaviour I want? More precisely processing elements from a queue in parallel
Note: I have tried to use tasks but with no success.
First off, pthreads gives you portability which is hard to walk away from. The following appear to be true from your question - let us know if these aren't true because the answer will then change:
1) You have a multi-core processor(s) on which you're running the code
2) You want to have no more than num_threads threads because of (1)
Assuming the above to be true, the following approach might work well for you:
Create num_threads pthreads using pthread_create
Optionally, bind each thread to a different core
q.push(new_element) atomically adds new_element to a queue. pthreads_mutex_lock and pthreads_mutex_unlock can help you here. Examples here: http://pages.cs.wisc.edu/~travitch/pthreads_primer.html
Use pthreads_mutexes for dequeueing elements
Termination is tricky - one way to do this is to add a TERMINATE element to the queue, which upon dequeueing, causes the dequeuer to queue up another TERMINATE element (for the next dequeuer) and then terminate. You will end up with one extra TERMINATE element in the queue, which you can remove by having a named thread dequeue it after all the threads are done.
Depending on how often you add/remove elements from the queue, you may want to use something lighter weight than pthread_mutex_... to enqueue/dequeue elements. This is where you might want to use a more machine-specific construct.
TBB is compatible with other threading packages.
TBB also emphasizes scalability. So when you port over your program to from a dual core to a quad core you do not have to adjust your program. With data parallel programming, program performance increases (scales) as you add processors.
Cilk Plus is also another runtime that provides good results.
www.cilkplus.org
Since pThreads is a low level theading library you have to decide how much control you need in your application because it does offer flexibility, but at a high cost in terms of programmer effort, debugging time, and maintenance costs.
My recommendation is to look at tbb::parallel_do. It was designed to process elements from a container in parallel, even if the container itself is not concurrent; i.e. parallel_do works with an std::queue correctly without any user synchronization (of course you would still need to protect your matrix access inside process_element(). Moreover, with parallel_do you can add more work on the fly, which looks like what you need, as process_element() creates and adds new elements to the work queue (the only caution is that the newly added work will be processed immediately, unlike putting in a queue which would postpone processing till after all "older" items). Also, you don't have to worry about termination: parallel_do will complete automatically as soon as all initial queue items and new items created on the fly are processed.
However, if, besides the computation itself, the work queue can be concurrently fed from another source (e.g. from an I/O processing thread), then parallel_do is not suitable. In this case, it might make sense to look at parallel_pipeline or, better, the TBB flow graph.
Lastly, an application can control the number of active threads with TBB, though it's not a recommended approach.

How does LMAX's disruptor pattern work?

I am trying to understand the disruptor pattern. I have watched the InfoQ video and tried to read their paper. I understand there is a ring buffer involved, that it is initialized as an extremely large array to take advantage of cache locality, eliminate allocation of new memory.
It sounds like there are one or more atomic integers which keep track of positions. Each 'event' seems to get a unique id and it's position in the ring is found by finding its modulus with respect to the size of the ring, etc., etc.
Unfortunately, I don't have an intuitive sense of how it works. I have done many trading applications and studied the actor model, looked at SEDA, etc.
In their presentation they mentioned that this pattern is basically how routers work; however I haven't found any good descriptions of how routers work either.
Are there some good pointers to a better explanation?
The Google Code project does reference a technical paper on the implementation of the ring buffer, however it is a bit dry, academic and tough going for someone wanting to learn how it works. However there are some blog posts that have started to explain the internals in a more readable way. There is an explanation of ring buffer that is the core of the disruptor pattern, a description of the consumer barriers (the part related to reading from the disruptor) and some information on handling multiple producers available.
The simplest description of the Disruptor is: It is a way of sending messages between threads in the most efficient manner possible. It can be used as an alternative to a queue, but it also shares a number of features with SEDA and Actors.
Compared to Queues:
The Disruptor provides the ability to pass a message onto another threads, waking it up if required (similar to a BlockingQueue). However, there are 3 distinct differences.
The user of the Disruptor defines how messages are stored by extending Entry class and providing a factory to do the preallocation. This allows for either memory reuse (copying) or the Entry could contain a reference to another object.
Putting messages into the Disruptor is a 2-phase process, first a slot is claimed in the ring buffer, which provides the user with the Entry that can be filled with the appropriate data. Then the entry must be committed, this 2-phase approach is necessary to allow for the flexible use of memory mentioned above. It is the commit that makes the message visible to the consumer threads.
It is the responsibility of the consumer to keep track of the messages that have been consumed from the ring buffer. Moving this responsibility away from the ring buffer itself helped reduce the amount of write contention as each thread maintains its own counter.
Compared to Actors
The Actor model is closer the Disruptor than most other programming models, especially if you use the BatchConsumer/BatchHandler classes that are provided. These classes hide all of the complexities of maintaining the consumed sequence numbers and provide a set of simple callbacks when important events occur. However, there are a couple of subtle differences.
The Disruptor uses a 1 thread - 1 consumer model, where Actors use an N:M model i.e. you can have as many actors as you like and they will be distributed across a fixed numbers of threads (generally 1 per core).
The BatchHandler interface provides an additional (and very important) callback onEndOfBatch(). This allows for slow consumers, e.g. those doing I/O to batch events together to improve throughput. It is possible to do batching in other Actor frameworks, however as nearly all other frameworks don't provide a callback at the end of the batch you need to use a timeout to determine the end of the batch, resulting in poor latency.
Compared to SEDA
LMAX built the Disruptor pattern to replace a SEDA based approach.
The main improvement that it provided over SEDA was the ability to do work in parallel. To do this the Disruptor supports multi-casting the same messages (in the same order) to multiple consumers. This avoids the need for fork stages in the pipeline.
We also allow consumers to wait on the results of other consumers without having to put another queuing stage between them. A consumer can simply watch the sequence number of a consumer that it is dependent on. This avoids the need for join stages in pipeline.
Compared to Memory Barriers
Another way to think about it is as a structured, ordered memory barrier. Where the producer barrier forms the write barrier and the consumer barrier is the read barrier.
First we'd like to understand the programming model it offers.
There are one or more writers. There are one or more readers. There is a line of entries, totally ordered from old to new (pictured as left to right). Writers can add new entries on the right end. Every reader reads entries sequentially from left to right. Readers can't read past writers, obviously.
There is no concept of entry deletion. I use "reader" instead of "consumer" to avoid the image of entries being consumed. However we understand that entries on the left of the last reader become useless.
Generally readers can read concurrently and independently. However we can declare dependencies among readers. Reader dependencies can be arbitrary acyclic graph. If reader B depends on reader A, reader B can't read past reader A.
Reader dependency arises because reader A can annotate an entry, and reader B depends on that annotation. For example, A does some calculation on an entry, and stores the result in field a in the entry. A then move on, and now B can read the entry, and the value of a A stored. If reader C does not depend on A, C should not attempt to read a.
This is indeed an interesting programming model. Regardless of the performance, the model alone can benefit lots of applications.
Of course, LMAX's main goal is performance. It uses a pre-allocated ring of entries. The ring is large enough, but it's bounded so that the system will not be loaded beyond design capacity. If the ring is full, writer(s) will wait until the slowest readers advance and make room.
Entry objects are pre-allocated and live forever, to reduce garbage collection cost. We don't insert new entry objects or delete old entry objects, instead, a writer asks for a pre-existing entry, populate its fields, and notify readers. This apparent 2-phase action is really simply an atomic action
setNewEntry(EntryPopulator);
interface EntryPopulator{ void populate(Entry existingEntry); }
Pre-allocating entries also means adjacent entries (very likely) locate in adjacent memory cells, and because readers read entries sequentially, this is important to utilize CPU caches.
And lots of efforts to avoid lock, CAS, even memory barrier (e.g. use a non-volatile sequence variable if there's only one writer)
For developers of readers: Different annotating readers should write to different fields, to avoid write contention. (Actually they should write to different cache lines.) An annotating reader should not touch anything that other non-dependent readers may read. This is why I say these readers annotate entries, instead of modify entries.
Martin Fowler has written an article about LMAX and the disruptor pattern, The LMAX Architecture, which may clarify it further.
I actually took the time to study the actual source, out of sheer curiosity, and the idea behind it is quite simple. The most recent version at the time of writing this post is 3.2.1.
There is a buffer storing pre-allocated events that will hold the data for consumers to read.
The buffer is backed by an array of flags (integer array) of its length that describes the availability of the buffer slots (see further for details). The array is accessed like a java#AtomicIntegerArray, so for the purpose of this explenation you may as well assume it to be one.
There can be any number of producers. When the producer wants to write to the buffer, an long number is generated (as in calling AtomicLong#getAndIncrement, the Disruptor actually uses its own implementation, but it works in the same manner). Let's call this generated long a producerCallId. In a similar manner, a consumerCallId is generated when a consumer ENDS reading a slot from a buffer. The most recent consumerCallId is accessed.
(If there are many consumers, the call with the lowest id is choosen.)
These ids are then compared, and if the difference between the two is lesser that the buffer side, the producer is allowed to write.
(If the producerCallId is greater than the recent consumerCallId + bufferSize, it means that the buffer is full, and the producer is forced to bus-wait until a spot becomes available.)
The producer is then assigned the slot in the buffer based on his callId (which is prducerCallId modulo bufferSize, but since the bufferSize is always a power of 2 (limit enforced on buffer creation), the actuall operation used is producerCallId & (bufferSize - 1)). It is then free to modify the event in that slot.
(The actual algorithm is a bit more complicated, involving caching recent consumerId in a separate atomic reference, for optimisation purposes.)
When the event was modified, the change is "published". When publishing the respective slot in the flag array is filled with the updated flag. The flag value is the number of the loop (producerCallId divided by bufferSize (again since bufferSize is power of 2, the actual operation is a right shift).
In a similar manner there can be any number of consumers. Each time a consumer wants to access the buffer, a consumerCallId is generated (depending on how the consumers were added to the disruptor the atomic used in id generation may be shared or separate for each of them). This consumerCallId is then compared to the most recent producentCallId, and if it is lesser of the two, the reader is allowed to progress.
(Similarly if the producerCallId is even to the consumerCallId, it means that the buffer is empety and the consumer is forced to wait. The manner of waiting is defined by a WaitStrategy during disruptor creation.)
For individual consumers (the ones with their own id generator), the next thing checked is the ability to batch consume. The slots in the buffer are examined in order from the one respective to the consumerCallId (the index is determined in the same manner as for producers), to the one respective to the recent producerCallId.
They are examined in a loop by comparing the flag value written in the flag array, against a flag value generated for the consumerCallId. If the flags match it means that the producers filling the slots has commited their changes. If not, the loop is broken, and the highest commited changeId is returned. The slots from ConsumerCallId to received in changeId can be consumed in batch.
If a group of consumers read together (the ones with shared id generator), each one only takes a single callId, and only the slot for that single callId is checked and returned.
From this article:
The disruptor pattern is a batching queue backed up by a circular
array (i.e. the ring buffer) filled with pre-allocated transfer
objects which uses memory-barriers to synchronize producers and
consumers through sequences.
Memory-barriers are kind of hard to explain and Trisha's blog has done the best attempt in my opinion with this post: http://mechanitis.blogspot.com/2011/08/dissecting-disruptor-why-its-so-fast.html
But if you don't want to dive into the low-level details you can just know that memory-barriers in Java are implemented through the volatile keyword or through the java.util.concurrent.AtomicLong. The disruptor pattern sequences are AtomicLongs and are communicated back and forth among producers and consumers through memory-barriers instead of locks.
I find it easier to understand a concept through code, so the code below is a simple helloworld from CoralQueue, which is a disruptor pattern implementation done by CoralBlocks with which I am affiliated. In the code below you can see how the disruptor pattern implements batching and how the ring-buffer (i.e. circular array) allows for garbage-free communication between two threads:
package com.coralblocks.coralqueue.sample.queue;
import com.coralblocks.coralqueue.AtomicQueue;
import com.coralblocks.coralqueue.Queue;
import com.coralblocks.coralqueue.util.MutableLong;
public class Sample {
public static void main(String[] args) throws InterruptedException {
final Queue<MutableLong> queue = new AtomicQueue<MutableLong>(1024, MutableLong.class);
Thread consumer = new Thread() {
#Override
public void run() {
boolean running = true;
while(running) {
long avail;
while((avail = queue.availableToPoll()) == 0); // busy spin
for(int i = 0; i < avail; i++) {
MutableLong ml = queue.poll();
if (ml.get() == -1) {
running = false;
} else {
System.out.println(ml.get());
}
}
queue.donePolling();
}
}
};
consumer.start();
MutableLong ml;
for(int i = 0; i < 10; i++) {
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(System.nanoTime());
queue.flush();
}
// send a message to stop consumer...
while((ml = queue.nextToDispatch()) == null); // busy spin
ml.set(-1);
queue.flush();
consumer.join(); // wait for the consumer thread to die...
}
}

Overhead due to use of Events

I have a custom thread pool class, that creates some threads that each wait on their own event (signal). When a new job is added to the thread pool, it wakes the first free thread so that it executes the job.
The problem is the following : I have around 1000 loops of each around 10'000 iterations do to. These loops must be executed sequentially, but I have 4 CPUs available. What I try to do is to split the 10'000 iteration loops into 4 2'500 iterations loops, ie one per thread. But I have to wait for the 4 small loops to finish before going to the next "big" iteration. This means that I can't bundle the jobs.
My problem is that using the thread pool and 4 threads is much slower than doing the jobs sequentially (having one loop executed by a separate thread is much slower than executing it directly in the main thread sequentially).
I'm on Windows, so I create events with CreateEvent() and then wait on one of them using WaitForMultipleObjects(2, handles, false, INFINITE) until the main thread calls SetEvent().
It appears that this whole event thing (along with the synchronization between the threads using critical sections) is pretty expensive !
My question is : is it normal that using events takes "a lot of" time ? If so, is there another mechanism that I could use and that would be less time-expensive ?
Here is some code to illustrate (some relevant parts copied from my thread pool class) :
// thread function
unsigned __stdcall ThreadPool::threadFunction(void* params) {
// some housekeeping
HANDLE signals[2];
signals[0] = waitSignal;
signals[1] = endSignal;
do {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
// try to get the next job parameters;
if (tp->getNextJob(threadId, data)) {
// execute job
void* output = jobFunc(data.params);
// tell thread pool that we're done and collect output
tp->collectOutput(data.ID, output);
}
tp->threadDone(threadId);
}
while (waitResult - WAIT_OBJECT_0 == 0);
// if we reach this point, endSignal was sent, so we are done !
return 0;
}
// create all threads
for (int i = 0; i < nbThreads; ++i) {
threadData data;
unsigned int threadId = 0;
char eventName[20];
sprintf_s(eventName, 20, "WaitSignal_%d", i);
data.handle = (HANDLE) _beginthreadex(NULL, 0, ThreadPool::threadFunction,
this, CREATE_SUSPENDED, &threadId);
data.threadId = threadId;
data.busy = false;
data.waitSignal = CreateEvent(NULL, true, false, eventName);
this->threads[threadId] = data;
// start thread
ResumeThread(data.handle);
}
// add job
void ThreadPool::addJob(int jobId, void* params) {
// housekeeping
EnterCriticalSection(&(this->mutex));
// first, insert parameters in the list
this->jobs.push_back(job);
// then, find the first free thread and wake it
for (it = this->threads.begin(); it != this->threads.end(); ++it) {
thread = (threadData) it->second;
if (!thread.busy) {
this->threads[thread.threadId].busy = true;
++(this->nbActiveThreads);
// wake thread such that it gets the next params and runs them
SetEvent(thread.waitSignal);
break;
}
}
LeaveCriticalSection(&(this->mutex));
}
This looks to me as a producer consumer pattern, which can be implented with two semaphores, one guarding the queue overflow, the other the empty queue.
You can find some details here.
Yes, WaitForMultipleObjects is pretty expensive. If your jobs are small, the synchronization overhead will start to overwhelm the cost of actually doing the job, as you're seeing.
One way to fix this is bundle multiple jobs into one: if you get a "small" job (however you evaluate such things), store it someplace until you have enough small jobs together to make one reasonably-sized job. Then send all of them to a worker thread for processing.
Alternately, instead of using signaling you could use a multiple-reader single-writer queue to store your jobs. In this model, each worker thread tries to grab jobs off the queue. When it finds one, it does the job; if it doesn't, it sleeps for a short period, then wakes up and tries again. This will lower your per-task overhead, but your threads will take up CPU even when there's no work to be done. It all depends on the exact nature of the problem.
Watch out, you are still asking for a next job after the endSignal is emitted.
for( ;; ) {
// wait for one of the signals
waitResult = WaitForMultipleObjects(2, signals, false, INFINITE);
if( waitResult - WAIT_OBJECT_0 != 0 )
return;
//....
}
Since you say that it is much slower in parallel than sequential execution, I assume that your processing time for your internal 2500 loop iterations is tiny (in the few micro seconds range). Then there is not much you can do except review your algorithm to split larger chunks of precessing; OpenMP won't help and every other synchronization techniques won't help either because they fundamentally all rely on events (spin loops do not qualify).
On the other hand, if your processing time of the 2500 loop iterations is larger than 100 micro seconds (on current PCs), you might be running into limitations of the hardware. If your processing uses a lot of memory bandwidth, splitting it to four processors will not give you more bandwidth, it will actually give you less because of collisions. You could also be running into problems of cache cycling where each of your top 1000 iteration will flush and reload the cache of the 4 cores. Then there is no one solution, and depending on your target hardware, there may be none.
If you are just parallelizing loops and using vs 2008, I'd suggest looking at OpenMP. If you're using visual studio 2010 beta 1, I'd suggesting looking at the parallel pattern library, particularly the "parallel for" / "parallel for each"
apis or the "task group class because these will likely do what you're attempting to do, only with less code.
Regarding your question about performance, here it really depends. You'll need to look at how much work you're scheduling during your iterations and what the costs are. WaitForMultipleObjects can be quite expensive if you hit it a lot and your work is small which is why I suggest using an implementation already built. You also need to ensure that you aren't running in debug mode, under a debugger and that the tasks themselves aren't blocking on a lock, I/O or memory allocation, and you aren't hitting false sharing. Each of these has the potential to destroy scalability.
I'd suggest looking at this under a profiler like xperf the f1 profiler in visual studio 2010 beta 1 (it has 2 new concurrency modes which help see contention) or Intel's vtune.
You could also share the code that you're running in the tasks, so folks could get a better idea of what you're doing, because the answer I always get with performance issues is first "it depends" and second, "have you profiled it."
Good Luck
-Rick
It shouldn't be that expensive, but if your job takes hardly any time at all, then the overhead of the threads and sync objects will become significant. Thread pools like this work much better for longer-processing jobs or for those that use a lot of IO instead of CPU resources. If you are CPU-bound when processing a job, ensure you only have 1 thread per CPU.
There may be other issues, how does getNextJob get its data to process? If there's a large amount of data copying, then you've increased your overhead significantly again.
I would optimise it by letting each thread keep pulling jobs off the queue until the queue is empty. that way, you can pass a hundred jobs to the thread pool and the sync objects will be used just the once to kick off the thread. I'd also store the jobs in a queue and pass a pointer, reference or iterator to them to the thread instead of copying the data.
The context switching between threads can be expensive too. It is interesting in some cases to develop a framework you can use to process your jobs sequentially with one thread or with multiple threads. This way you can have the best of the two worlds.
By the way, what is your question exactly ? I will be able to answer more precisely with a more precise question :)
EDIT:
The events part can consume more than your processing in some cases, but should not be that expensive, unless your processing is really fast to achieve. In this case, switching between thredas is expensive too, hence my answer first part on doing things sequencially ...
You should look for inter-threads synchronisation bottlenecks. You can trace threads waiting times to begin with ...
EDIT: After more hints ...
If I guess correctly, your problem is to efficiently use all your computer cores/processors to parralellize some processing essencialy sequential.
Take that your have 4 cores and 10000 loops to compute as in your example (in a comment). You said that you need to wait for the 4 threads to end before going on. Then you can simplify your synchronisation process. You just need to give your four threads thr nth, nth+1, nth+2, nth+3 loops, wait for the four threads to complete then going on. You should use a rendezvous or barrier (a synchronization mechanism that wait for n threads to complete). Boost has such a mechanism. You can look the windows implementation for efficiency. Your thread pool is not really suited to the task. The search for an available thread in a critical section is what is killing your CPU time. Not the event part.
It appears that this whole event thing
(along with the synchronization
between the threads using critical
sections) is pretty expensive !
"Expensive" is a relative term. Are jets expensive? Are cars? or bicycles... shoes...?
In this case, the question is: are events "expensive" relative to the time taken for JobFunction to execute? It would help to publish some absolute figures: How long does the process take when "unthreaded"? Is it months, or a few femtoseconds?
What happens to the time as you increase the threadpool size? Try a pool size of 1, then 2 then 4, etc.
Also, as you've had some issues with threadpools here in the past, I'd suggest some debug
to count the number of times that your threadfunction is actually invoked... does it match what you expect?
Picking a figure out of the air (without knowing anything about your target system, and assuming you're not doing anything 'huge' in code you haven't shown), I'd expect the "event overhead" of each "job" to be measured in microseconds. Maybe a hundred or so. If the time taken to perform the algorithm in JobFunction is not significantly MORE than this time, then your threads are likely to cost you time rather than save it.
As mentioned previously, the amount of overhead added by threading depends on the relative amount of time taken to do the "jobs" that you defined. So it is important to find a balance in the size of the work chunks that minimizes the number of pieces but does not leave processors idle waiting for the last group of computations to complete.
Your coding approach has increased the amount of overhead work by actively looking for an idle thread to supply with new work. The operating system is already keeping track of that and doing it a lot more efficiently. Also, your function ThreadPool::addJob() may find that all of the threads are in use and be unable to delegate the work. But it does not provide any return code related to that issue. If you are not checking for this condition in some way and are not noticing errors in the results, it means that there are idle processors always. I would suggest reorganizing the code so that addJob() does what it is named -- adds a job ONLY (without finding or even caring who does the job) while each worker thread actively gets new work when it is done with its existing work.