Context
I have a long running process returning a result. This result may need further processing. There can also be multiple results ready simultaneously. Each one of these results that needs further processing may take a while too.
I need to take each one of those results, throw each into a queue of sorts and process each one, with the ability to start/stop this process.
Problem:
At the moment, I am limited to using a QFuture with a QFutureWatcher and using either a QtConcurrent::mapped() or QtConcurrent::run() function (the latter which doesn't support direct pause/stop functionality, see note below).
The problem with the approech(es) mentioned above is it requires all results to be known up front, however to decrease the overall processing time, I would like to process each result as it comes in.
How can I effectively process create a thread pool with a queue of tasks?
Related
Suppose I have a multi-threaded program in C++11, in which each thread controls the behavior of something displayed to the user.
I want to ensure that for every time period T during which one of the threads of the given program have run, each thread gets a chance to execute for at least time t, so that the display looks as if all threads are executing simultaneously. The idea is to have a mechanism for round robin scheduling with time sharing based on some information stored in the thread, forcing a thread to wait after its time slice is over, instead of relying on the operating system scheduler.
Preferably, I would also like to ensure that each thread is scheduled in real time.
In case there is no way other than relying on the operating system, is there any solution for Linux?
Is it possible to do this? How?
No that's not cross-platform possible with C++11 threads. How often and how long a thread is called isn't up to the application. It's up to the operating system you're using.
However, there are still functions with which you can flag the os that a special thread/process is really important and so you can influence this time fuzzy for your purposes.
You can acquire the platform dependent thread handle to use OS functions.
native_handle_type std::thread::native_handle //(since C++11)
Returns the implementation defined underlying thread handle.
I just want to claim again, this requires a implementation which is different for each platform!
Microsoft Windows
According to the Microsoft documentation:
SetThreadPriority function
Sets the priority value for the specified thread. This value, together
with the priority class of the thread's process determines the
thread's base priority level.
Linux/Unix
For Linux things are more difficult because there are different systems how threads can be scheduled. Under Microsoft Windows it's using a priority system but on Linux this doesn't seem to be the default scheduling.
For more information, please take a look on this stackoverflow question(Should be the same for std::thread because of this).
I want to ensure that for every time period T during which one of the threads of the given program have run, each thread gets a chance to execute for at least time t, so that the display looks as if all threads are executing simultaneously.
You are using threads to make it seem as though different tasks are executing simultaneously. That is not recommended for the reasons stated in Arthur's answer, to which I really can't add anything.
If instead of having long living threads each doing its own task you can have a single queue of tasks that can be executed without mutual exclusion - you can have a queue of tasks and a thread pool dequeuing and executing tasks.
If you cannot, you might want to look into wait free data structures and algorithms. In a wait free algorithm/data structure, every thread is guaranteed to complete its work in a finite (and even specified) number of steps. I can recommend the book The Art of Multiprocessor Programming where this topic is discussed in length. The gist of it is: every lock free algorithm/data structure can be modified to be wait free by adding communication between threads over which a thread that's about to do work makes sure that no other thread is starved/stalled. Basically, prefer fairness over total throughput of all threads. In my experience this is usually not a good compromise.
I have used std::async for speeding up the execution of a task, which was previously being executed sequentially.
My implementation does the following:
Launches a pre-configured number of tasks (for e.g. at max 10 concurrent tasks)
The futures for these tasks are stored in a vector.
As soon as one task is finished, it launches another task, so that at any point of time, at max 10 tasks (this value is configured) are running.
After launching 10 tasks, my implementation waits for the oldest task (i.e. the first task in the vector) to complete, by calling get() on that future.
Though this works fine, it is possible that any of the 10 tasks could complete first. My implementation always waits on the first task in the vector. Is there a way to know, which of the 10 tasks completed first?
For e.g. future object itself signalling that it is ready.
I want to achieve functionality similar to "WhenAny()" function mentioned in this article: https://msdn.microsoft.com/en-us/library/jj155756.aspx
I think that the equivalent to WhenAny (C#) has not yet been incorporated in the C++11/14 standard, thought it is being considered as an experimental future extension (see this).
I think latests versions of Boost libraries include when_any, check this out.
This company sells also a complete thread library that includes when_any.
Give them each an ID and use an atomic to store the first that finishes.
Somewhere in scope of all of the functions:
std::atomic<int> first_id(0);
At the completion of each task:
first_id.compare_exchange_strong(0, id);
Where id is from 1 to 10. It will only run once and the first one to run this will be the one to replace the 0.
Edit: The above is the answer to your literal question. But it doesn't really help you do what you want. To do what you want, I'd change the vector to a queue and have each task enqueue the next when it exits (you'll also need a lock to lock the queue before modifying it.) Alternatively, you could use a threadpool. (Shameless plug: here's mine.) A thread pool would let you enqueue all the tasks, but only use n threads, which will prevent overtaxing the scheduler while still making the coding simple.
Use a thread pool. Queue all of your tasks. Have them report to an atomic counter when done, and have them signal a condition variable when the counter says all are done.
Thread pool implementations abound on stack overflow and elsewhere.
I find using C++11 threading primitives directly in client code is questionable; using them to write little helpers is a better idea.
Sorry if this has been asked, I did my best to search for a dup before asking...
I am implementing a video processing application that is supposed to run real time. The processing it does can be easily divided into 4 stages, each one operating on the intermediate values generated from the previous stage. The processing has become heavier than what can be processed in 1/30th of a second, but if I can split this application in 4 threads and turn it into a pipeline, each stage takes less than that and the whole thing would run realtime (with a 4 frame lag, which is completely acceptable).
I'm fairly new to multithreading programming, and the problem I'm having is, I can't find a mechanism to start/stop each thread at the beginning of each frame, so they all march together, delivering one finished frame every "cycle" at the end. All the frameworks/libraries I found seem to worry about load balancing using queues and worker threads, but this is not what I need here. Four threads will do, assuming I can keep them synced.
Can anybody point me to a starting point, using C++?
Thanks.
Assuming a 4 frame lag is acceptable, you could use a pool of list nodes, each with a pointer to a frame buffer and a pointer to the intermediate values (a NULL pointer could be used to indicate end of a stream). Each thread would have it's own list as part of multithread messaging system. The first thread would get a frame node from a free pool, do it's processing and send the node to the next threads list, and so on, with the last thread returning nodes back to the free pool.
Here is a link to an example file copy program that spawns a thread to do the writes. It uses Windows threading, mutexes, and semaphores in the messaging functions, but the messaging functions are simple and could be changed internally to use generic equivalents without changing their interface. The main() function could be changed to use generic threading and setup of the mutexes and semaphores or something equivalent.
mtcopy.zip
I'm sorry I don't seem to get intel's TBB it seems great & supported but I can't wrap my head around how to use it since I guess I'm not used to thinking of parallelism in terms of tasks but instead saw it as threads.
My current workload has a job that sends work to a queue to keep processing(think of a recursion but instead of calling itself it sends work to a queue). The way I got this working in Java was to create a concurrent queue(non-blocking queue) and threadpoolexecutor that worked the queue/send work back to it. But now I'm trying to do something similar in c++, I found TBB can create pools but its approach is very different(Java threads seem to just keep working as long as their is work in the queue but TBB seems to break the task down at the beginning).
Here's a simple Java example of what I do(before this I set how many threads I want,etc..):
static class DoWork implements Callable<Void> {
// queue with contexts to process
private Queue<int> contexts;
DoWork(Context request) {
contexts = new ArrayDeque<int>();
contexts.add(request);
}
public Void call() {
while(!contexts.isEmpty()) {
//do work
contexts.add(new int(data)); //if needs to be send back to the queue to do more work
}
}
}
I sure its possible to do this in TBB, but I'm just not sure how because it seems to break up my work at the time I send it. So if it there's 2 items in the queue it may only launch 2 threads but won't grow as more work comes in(even if I have 8 cores).
Can someone help me understand how to achieve my tasks and also maybe suggest a better way to think about TBB coming from using Java's threading environment(also I have no allegiance to TBB, so if there's something easier/better then I'm happy to learn it. I just don't like c++ threadpool because it doesn't seem actively developed)?
The approach based on having a queue of items for parallel processing, where each thread just pops one item from the queue and proceeds (and possibly adds a new item to the end of the queue at some point) is fundamentally wrong since it limits parallelism of the application. The queue becomes a single point of synchronization, and threads need to wait in order to get access to the next item to process. In practice this approach works when tasks (each items' processing job) are quite large and take different times to complete, allowing queue to be less contended as opposed to when (most of the) threads finish at the same time and come to the queue for their next items to process.
If you're writing a somewhat reusable piece of code you can not guarantee that tasks are either large enough or that they vary in size (time to execute).
I assume that you application scales, which means that you start with some significant number of items (much larger than the number of threads) in your queue and while threads do the processing they add enough tasks to the end, so that there's enough job for everyone until application finishes.
If that's the case I would rather suggested that you kept two thread-safe vectors of your items (TBB's concurrent_vectors for instance) for interchangeability. You start with one vector (your initial set of items) and you enque() a task (I think it's described somewhere in chapter 12 of the TBB reference manual), which executes a parallel_for over the initial vector of items. While the first batch is being processed you would push_back the new items onto the second concurrent_vector and when you're done with the first one you enque() a task with a parallel_for over the second vector and start pushing new items back into the first one. You can try and overlap parallel processing of items better by having three vectors instead of two and gradually moving between them while there's still enough work for all thread to be kept busy.
What you're trying to do is exactly the sort of thing TBB's parallel_do is designed for. The "body" invoked by parallel_do is passed a "feeder" argument which you can do feeder.add(...some new task...) on while processing tasks to create new tasks for execution before the current parallel_do completes.
I have a multi-threaded application that is using pthreads. I have a mutex() lock and condition variables(). There are two threads, one thread is producing data for the second thread, a worker, which is trying to process the produced data in a real time fashion such that one chuck is processed as close to the elapsing of a fixed time period as possible.
This works pretty well, however, occasionally when the producer thread releases the condition upon which the worker is waiting, a delay of up to almost a whole second is seen before the worker thread gets control and executes again.
I know this because right before the producer releases the condition upon which the worker is waiting, it does a chuck of processing for the worker if it is time to process another chuck, then immediately upon receiving the condition in the worker thread, it also does a chuck of processing if it is time to process another chuck.
In this later case, I am seeing that I am late processing the chuck many times. I'd like to eliminate this lost efficiency and do what I can to keep the chucks ticking away as close to possible to the desired frequency.
Is there anything I can do to reduce the delay between the release condition from the producer and the detection that that condition is released such that the worker resumes processing? For example, would it help for the producer to call something to force itself to be context switched out?
Bottom line is the worker has to wait each time it asks the producer to create work for itself so that the producer can muck with the worker's data structures before telling the worker it is ready to run in parallel again. This period of exclusive access by the producer is meant to be short, but during this period, I am also checking for real-time work to be done by the producer on behalf of the worker while the producer has exclusive access. Somehow my hand off back to running in parallel again results in significant delay occasionally that I would like to avoid. Please suggest how this might be best accomplished.
I could suggest the following pattern. Generally the same technique could be used, e.g. when prebuffering frames in some real-time renderers or something like that.
First, it's obvious that approach that you describe in your message would only be effective if both of your threads are loaded equally (or almost equally) all the time. If not, multi-threading would actually benefit in your situation.
Now, let's think about a thread pattern that would be optimal for your problem. Assume we have a yielding and a processing thread. First of them prepares chunks of data to process, the second makes processing and stores the processing result somewhere (not actually important).
The effective way to make these threads work together is the proper yielding mechanism. Your yielding thread should simply add data to some shared buffer and shouldn't actually care about what would happen with that data. And, well, your buffer could be implemented as a simple FIFO queue. This means that your yielding thread should prepare data to process and make a PUSH call to your queue:
X = PREPARE_DATA()
BUFFER.LOCK()
BUFFER.PUSH(X)
BUFFER.UNLOCK()
Now, the processing thread. It's behaviour should be described this way (you should probably add some artificial delay like SLEEP(X) between calls to EMPTY)
IF !EMPTY(BUFFER) PROCESS(BUFFER.TOP)
The important moment here is what should your processing thread do with processed data. The obvious approach means making a POP call after the data is processed, but you will probably want to come with some better idea. Anyway, in my variant this would look like
// After data is processed
BUFFER.LOCK()
BUFFER.POP()
BUFFER.UNLOCK()
Note that locking operations in yielding and processing threads shouldn't actually impact your performance because they are only called once per chunk of data.
Now, the interesting part. As I wrote at the beginning, this approach would only be effective if threads act somewhat the same in terms of CPU / Resource usage. There is a way to make these threading solution effective even if this condition is not constantly true and matters on some other runtime conditions.
This way means creating another thread that is called controller thread. This thread would merely compare the time that each thread uses to process one chunk of data and balance the thread priorities accordingly. Actually, we don't have to "compare the time", the controller thread could simply work the way like:
IF BUFFER.SIZE() > T
DECREASE_PRIORITY(YIELDING_THREAD)
INCREASE_PRIORITY(PROCESSING_THREAD)
Of course, you could implement some better heuristics here but the approach with controller thread should be clear.