TBB for a workload that keeps changing? - c++

I'm sorry I don't seem to get intel's TBB it seems great & supported but I can't wrap my head around how to use it since I guess I'm not used to thinking of parallelism in terms of tasks but instead saw it as threads.
My current workload has a job that sends work to a queue to keep processing(think of a recursion but instead of calling itself it sends work to a queue). The way I got this working in Java was to create a concurrent queue(non-blocking queue) and threadpoolexecutor that worked the queue/send work back to it. But now I'm trying to do something similar in c++, I found TBB can create pools but its approach is very different(Java threads seem to just keep working as long as their is work in the queue but TBB seems to break the task down at the beginning).
Here's a simple Java example of what I do(before this I set how many threads I want,etc..):
static class DoWork implements Callable<Void> {
// queue with contexts to process
private Queue<int> contexts;
DoWork(Context request) {
contexts = new ArrayDeque<int>();
contexts.add(request);
}
public Void call() {
while(!contexts.isEmpty()) {
//do work
contexts.add(new int(data)); //if needs to be send back to the queue to do more work
}
}
}
I sure its possible to do this in TBB, but I'm just not sure how because it seems to break up my work at the time I send it. So if it there's 2 items in the queue it may only launch 2 threads but won't grow as more work comes in(even if I have 8 cores).
Can someone help me understand how to achieve my tasks and also maybe suggest a better way to think about TBB coming from using Java's threading environment(also I have no allegiance to TBB, so if there's something easier/better then I'm happy to learn it. I just don't like c++ threadpool because it doesn't seem actively developed)?

The approach based on having a queue of items for parallel processing, where each thread just pops one item from the queue and proceeds (and possibly adds a new item to the end of the queue at some point) is fundamentally wrong since it limits parallelism of the application. The queue becomes a single point of synchronization, and threads need to wait in order to get access to the next item to process. In practice this approach works when tasks (each items' processing job) are quite large and take different times to complete, allowing queue to be less contended as opposed to when (most of the) threads finish at the same time and come to the queue for their next items to process.
If you're writing a somewhat reusable piece of code you can not guarantee that tasks are either large enough or that they vary in size (time to execute).
I assume that you application scales, which means that you start with some significant number of items (much larger than the number of threads) in your queue and while threads do the processing they add enough tasks to the end, so that there's enough job for everyone until application finishes.
If that's the case I would rather suggested that you kept two thread-safe vectors of your items (TBB's concurrent_vectors for instance) for interchangeability. You start with one vector (your initial set of items) and you enque() a task (I think it's described somewhere in chapter 12 of the TBB reference manual), which executes a parallel_for over the initial vector of items. While the first batch is being processed you would push_back the new items onto the second concurrent_vector and when you're done with the first one you enque() a task with a parallel_for over the second vector and start pushing new items back into the first one. You can try and overlap parallel processing of items better by having three vectors instead of two and gradually moving between them while there's still enough work for all thread to be kept busy.

What you're trying to do is exactly the sort of thing TBB's parallel_do is designed for. The "body" invoked by parallel_do is passed a "feeder" argument which you can do feeder.add(...some new task...) on while processing tasks to create new tasks for execution before the current parallel_do completes.

Related

What is the most efficient way to coordinate between threads about which threads are free?

I'm working on a program that sends messages between threads, it looks at which threads are busy, if one is free it grabs the first free one(or in some cases multiple free ones), marks it as taken, sends work to it and does it's own work, then once finished waits for it to complete. The part that is the bottleneck of all of this is coordinating between threads about which thread is taken. Seems like a problem I'm sure others have encountered, have some solutions to share, but also want to know if you can do better than me.
My solution ultimately boils down to:
Maintain a set representing indexes of free threads, and be able to grab an item from the set getting the index of a free thread or add it back to the set increasing the size by one. Order unimportant. I know the fixed size of the set in advance.
I've tried a few ways of doing this:
Maintain a single unsigned long long int and use '__builtin_clz'(Interesting __builtin_ffsll was 10x slower.. thinking not supported with a single instruction on my processor) to count the number of bits in a single instruction cycle and grab the lowest one and use a lookup table of bitmasks to flip bits on and off, simultaneously claiming their thread number. Loved this version because I only needed to share a single atomic unsigned long long and could use a single atomic operation but doing 'fetch_and' in a loop until you are right ended up slowing than locking and doing non-atomically. The version using locking ended up being faster, probably because threads didn't get stuck in loops repeating the same operations waiting for others to finish theirs.
Use a linked list, allocate all nodes in advance, maintain a head node and a list, if pointing to nullptr, then we've reached the end of the list. Have only done this with a lock because it needs two simultaneous operations.
Maintain an array that represents all indexes of threads to claim. Either increment an array index and return previous pointer to claim a thread, or swap the last taken thread with the one being freed and decrement the pointer. Check if free.
Use the moodycamel queue which maintains a lock free queue.
Happy to share C++ code, the answer was getting to be quite long though when I tried to include it.
All three are fast, __builtin_clzll is not universally supported, so even though a little faster, probably not enough so to be worth it and probably 10x slower on computers that don't natively support it, similar to how __builtin_ffsll was slow. Array and linked list are roughly as fast as each other, array seems slightly faster when no contention. Moody is 3x slower.
Think you can do better and have a faster way to do this? Still the slowest part of this process, still just barely being worth the cost in some cases.
Thoughts for directions to explore:
Feels like there should be a way using a couple of atomics, maybe an array of atomics, one at a time, have to maintain the integrity of the set with every operation though, which makes this tricky. Most solutions at some point need two operations to be done simultaneously, atomics seem like they could provide a significantly faster solution than locking in my benchmarking.
Might be able to use lock but remove the need to check if the list is empty or swap elements in array
Maybe use a different data structure, for example, two arrays, add to one while emptying the other, then switch which one is being filled and which is emptied. This means no need to swap elements but rather just swap two pointers to arrays and only when one is empty.
Could have threads launching threads add work to a list of work to be done, then another thread can grab it while this thread keeps going. Ultimately still need a similar thread safe set.
See if the brilliant people on stackoverflow see directions to explore that I haven't seen yet :)
All you need is a thread pool, a queue (a list, deque or a ring buffer), a mutex and a condition_variable to signal when a new work item has been added to the queue.
Wrap work items in packaged_task if you need to wait on the result of each task.
When adding a new work item to the queue, 1) lock the mutex, 2) add the item, 3) release the lock and 4) call cv::notify_one, which will unblock the first available thread.
Once the basic setup is working, if tasks are too fine-grained, work stealing can be added to the solution to improve performance. It's also important to use the main thread to perform part of the work instead of just waiting for all tasks to complete. These simple (albeit clunky) optimizations often result in >15% overall improvement in performance due to reduced context switching.
Also don't forget to think about false sharing. It might be a good idea to pad task items to 64 bytes, just to be on the safe side.

C++ synced stages multi thread pipeline

Sorry if this has been asked, I did my best to search for a dup before asking...
I am implementing a video processing application that is supposed to run real time. The processing it does can be easily divided into 4 stages, each one operating on the intermediate values generated from the previous stage. The processing has become heavier than what can be processed in 1/30th of a second, but if I can split this application in 4 threads and turn it into a pipeline, each stage takes less than that and the whole thing would run realtime (with a 4 frame lag, which is completely acceptable).
I'm fairly new to multithreading programming, and the problem I'm having is, I can't find a mechanism to start/stop each thread at the beginning of each frame, so they all march together, delivering one finished frame every "cycle" at the end. All the frameworks/libraries I found seem to worry about load balancing using queues and worker threads, but this is not what I need here. Four threads will do, assuming I can keep them synced.
Can anybody point me to a starting point, using C++?
Thanks.
Assuming a 4 frame lag is acceptable, you could use a pool of list nodes, each with a pointer to a frame buffer and a pointer to the intermediate values (a NULL pointer could be used to indicate end of a stream). Each thread would have it's own list as part of multithread messaging system. The first thread would get a frame node from a free pool, do it's processing and send the node to the next threads list, and so on, with the last thread returning nodes back to the free pool.
Here is a link to an example file copy program that spawns a thread to do the writes. It uses Windows threading, mutexes, and semaphores in the messaging functions, but the messaging functions are simple and could be changed internally to use generic equivalents without changing their interface. The main() function could be changed to use generic threading and setup of the mutexes and semaphores or something equivalent.
mtcopy.zip

Find minimum queue size among threads

I am trying to implement a new scheduling technique with Multithreads. Each Thread has it own private local queue. The idea is, each time the task is created from the program thread, it should search the minimum queue sizes ( a queue with less number of tasks) among the queues and enqueue in it.
A way of load balancing among threads, where less busy queues enqueued more.
Can you please suggest some logics (or) idea how to find the minimum size queues among the given queues dynamically in programming point of view.
I am working on visual studio 2008, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .
As you see trying to find the less loaded queue is cumbersome and could be an inefficient method as you may add more work to queues with only one heavy task, whereas queues with small tasks will have nor more jobs and become quickly inactive.
You'd better use a work-stealing heuristic : when a thread is done with its own jobs it will look at the other threads queues and "steal" some work instead of remaining idle or be terminated.
Then the system will be auto-balanced with each thread being active until there is not enough work for everyone.
You should not have a situation with idle threads and work waiting for processing.
If you really want to try this, can each queue not just keep a public 'int count' member, updated with atomic inc/dec as tasks are pushed/popped?
Whether such a design is worth the management overhead and the occasional 'mistakes' when a task is queued to a thread that happens to be running a particularly lengthy job when another thread is just about to dequeue a very short job, is another issue.
Why aren't the threads fetching their work from a 'master' work queue ?
If you are really trying to distribute work items from a master source, to a set of workers, you are then doing load balancing, as you say. In that case, you really are talking about scheduling, unless you simply do round-robin style balancing. Scheduling is a very deep subject in Computing, you can easily spend weeks, or months learning about it.
You could synchronise a counter among the threads. But I guess this isn't what you want.
Since you want to implement everything using dataflow, everything should be queues.
Your first option is to query the number of jobs inside a queue. I think this is not easy, if you want a single reader/writer pattern, because you probably have to use lock for this operation, which is not what you want. Note: I'm just guessing, that you can't use lock-free queues here; either you have a counter or take the difference of two pointers, either way you have a lock.
Your second option (which can be done with lock-free code) is to send a command back to the dispatcher thread, telling him that worker thread x has consumed a job. Using this approach you have n more queues, each from one worker thread to the dispatcher thread.

How can I improve my real-time behavior in multi-threaded app using pthreads and condition variables?

I have a multi-threaded application that is using pthreads. I have a mutex() lock and condition variables(). There are two threads, one thread is producing data for the second thread, a worker, which is trying to process the produced data in a real time fashion such that one chuck is processed as close to the elapsing of a fixed time period as possible.
This works pretty well, however, occasionally when the producer thread releases the condition upon which the worker is waiting, a delay of up to almost a whole second is seen before the worker thread gets control and executes again.
I know this because right before the producer releases the condition upon which the worker is waiting, it does a chuck of processing for the worker if it is time to process another chuck, then immediately upon receiving the condition in the worker thread, it also does a chuck of processing if it is time to process another chuck.
In this later case, I am seeing that I am late processing the chuck many times. I'd like to eliminate this lost efficiency and do what I can to keep the chucks ticking away as close to possible to the desired frequency.
Is there anything I can do to reduce the delay between the release condition from the producer and the detection that that condition is released such that the worker resumes processing? For example, would it help for the producer to call something to force itself to be context switched out?
Bottom line is the worker has to wait each time it asks the producer to create work for itself so that the producer can muck with the worker's data structures before telling the worker it is ready to run in parallel again. This period of exclusive access by the producer is meant to be short, but during this period, I am also checking for real-time work to be done by the producer on behalf of the worker while the producer has exclusive access. Somehow my hand off back to running in parallel again results in significant delay occasionally that I would like to avoid. Please suggest how this might be best accomplished.
I could suggest the following pattern. Generally the same technique could be used, e.g. when prebuffering frames in some real-time renderers or something like that.
First, it's obvious that approach that you describe in your message would only be effective if both of your threads are loaded equally (or almost equally) all the time. If not, multi-threading would actually benefit in your situation.
Now, let's think about a thread pattern that would be optimal for your problem. Assume we have a yielding and a processing thread. First of them prepares chunks of data to process, the second makes processing and stores the processing result somewhere (not actually important).
The effective way to make these threads work together is the proper yielding mechanism. Your yielding thread should simply add data to some shared buffer and shouldn't actually care about what would happen with that data. And, well, your buffer could be implemented as a simple FIFO queue. This means that your yielding thread should prepare data to process and make a PUSH call to your queue:
X = PREPARE_DATA()
BUFFER.LOCK()
BUFFER.PUSH(X)
BUFFER.UNLOCK()
Now, the processing thread. It's behaviour should be described this way (you should probably add some artificial delay like SLEEP(X) between calls to EMPTY)
IF !EMPTY(BUFFER) PROCESS(BUFFER.TOP)
The important moment here is what should your processing thread do with processed data. The obvious approach means making a POP call after the data is processed, but you will probably want to come with some better idea. Anyway, in my variant this would look like
// After data is processed
BUFFER.LOCK()
BUFFER.POP()
BUFFER.UNLOCK()
Note that locking operations in yielding and processing threads shouldn't actually impact your performance because they are only called once per chunk of data.
Now, the interesting part. As I wrote at the beginning, this approach would only be effective if threads act somewhat the same in terms of CPU / Resource usage. There is a way to make these threading solution effective even if this condition is not constantly true and matters on some other runtime conditions.
This way means creating another thread that is called controller thread. This thread would merely compare the time that each thread uses to process one chunk of data and balance the thread priorities accordingly. Actually, we don't have to "compare the time", the controller thread could simply work the way like:
IF BUFFER.SIZE() > T
DECREASE_PRIORITY(YIELDING_THREAD)
INCREASE_PRIORITY(PROCESSING_THREAD)
Of course, you could implement some better heuristics here but the approach with controller thread should be clear.

How do I tell a multi-core / multi-CPU machine to process function calls in a loop in parallel?

I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances.
Many of the more intensive operations behave deterministically and would lend themselves to parallel processing.
Provided I have a loop that iterates over a large number of data chunks arriving from the db and for each one call a deterministic function without side effects, how would I make it so that the program does not wait for the function to return but rather sets the next calls going, so they could be processed in parallel? A naive approach to demonstrate the principle would do me for now.
I have read Google's MapReduce paper and while I could use the overall principle in a number of places, I won't, for now, target large clusters, rather it's going to be a single multi-core or multi-CPU machine for version 1.0. So currently, I'm not sure if I can actually use the library or would have to roll a dumbed-down basic version myself.
I am at an early stage of the design process and so far I am targeting C-something (for the speed critical bits) and Python (for the productivity critical bits) as my languages. If there are compelling reasons, I might switch, but so far I am contented with my choice.
Please note that I'm aware of the fact that it might take longer to retrieve the next chunk from the database than to process the current one and the whole process would then be I/O-bound. I would, however, assume for now that it isn't and in practice use a db cluster or memory caching or something else to be not I/O-bound at this point.
Well, if .net is an option, they have put a lot of effort into Parallel Computing.
If you still plan on using Python, you might want to have a look at Processing. It uses processes rather than threads for parallel computing (due to the Python GIL) and provides classes for distributing "work items" onto several processes. Using the pool class, you can write code like the following:
import processing
def worker(i):
return i*i
num_workers = 2
pool = processing.Pool(num_workers)
result = pool.imap(worker, range(100000))
This is a parallel version of itertools.imap, which distributes calls over to processes. You can also use the apply_async methods of the pool and store lazy result objects in a list:
results = []
for i in range(10000):
results.append(pool.apply_async(worker, i))
For further reference, see the documentation of the Pool class.
Gotchas:
processing uses fork(), so you have to be careful on Win32
objects transferred between processes need to be pickleable
if the workers are relatively fast, you can tweak chunksize, i.e.
the number of work items send to a worker process in one batch
processing.Pool uses a background thread
You can implement the algorithm from Google's MapReduce without having physically separate machines. Just consider each of those "machines" to be "threads." Threads are automatically distributed on multi-core machines.
I might be missing something here, but this this seems fairly straight forward using pthreads.
Set up a small threadpool with N threads in it and have one thread to control them all.
The master thread simply sits in a loop doing something like:
Get data chunk from DB
Find next free thread If no thread is free then wait
Hand over chunk to worker thread
Go back and get next chunk from DB
In the meantime the worker threads they sit and do:
Mark myself as free
Wait for the mast thread to give me a chunk of data
Process the chunk of data
Mark myself as free again
The method by which you implement this can be as simple as two mutex controlled arrays. One has the worked threads in it (the threadpool) and the other indicated if each corresponding thread is free or busy.
Tweak N to your liking ...
If you're working with a compiler that will support it, I would suggest taking a look at http://www.openmp.org for a way of annotating your code in such a way that
certain loops will be parallelized.
It does a lot more as well, and you might find it very helpful.
Their web page reports that gcc4.2 will support openmp, for example.
The same thread pool is used in java. But the threads in threadpools are serialisable and sent to other computers and deserialised to run.
I have developed a MapReduce library for multi-threaded/multi-core use on a single server. Everything is taken care of by the library, and the user just has to implement Map and Reduce. It is positioned as a Boost library, but not yet accepted as a formal lib. Check out http://www.craighenderson.co.uk/mapreduce
You may be interested in examining the code of libdispatch, which is the open source implementation of Apple's Grand Central Dispatch.
Intel's TBB or boost::mpi might be of interest to you also.