Let's say I have a Writer class that generates some data, and a Reader class that consumes it. I want them to run all the time under different threads. How can I do that with OpenMP?
This is want I would like to have:
class Reader
{
public:
void run();
};
class Writer
{
public:
void run();
};
int main()
{
Reader reader;
Writer writer;
reader.run(); // starts asynchronously
writer.run(); // starts asynchronously
wait_until_finished();
}
I guess the first answers will point to separate each operation into a section, but sections does not guarantee that code blocks will be given to different threads.
Can tasks do it? As far as I understood after reading about task is that each code block is executed just once, but the assigned thread can change.
Any other solution?
I would like to know this to know if a code I have inherited that uses pthreads, which explicitly creates several threads, could be written with OpenMP. The issue is that some threads were not smartly written and contain active waiting loops. In that situation, if two objects with active waiting are assigned to the same OpenMP thread (and hence are executed sequentially), they can reach a deadlock. At least, I think that could happen with sections, but I am not sure about tasks.
Serialisation could also happen with tasks. One horrible solution would be to reimplement sections on your own with guarantee that each section would run in a separate thread:
#pragma omp parallel num_threads(3)
{
switch (omp_get_thread_num())
{
case 0: wait_until_finished(); break;
case 1: reader.run(); break;
case 2: writer.run(); break;
}
}
This code assumes that you would like wait_until_finished() to execute in parallel with reader.run() and writer.run(). This is necessary since in OpenMP only the scope of the parallel construct is where the program executes in parallel and there is no way to put things in the background, so to say.
If you're rewriting the code anyway, you might be better moving to Threading Building Blocks (TBB; http://www.threadingbuildingblocks.org).
TBB has explicit support for pipeline style operation (or more complicated task graphs), while maintaing cache-locality and independence of the underlying number of threads.
Related
I have a block of code that goes through a loop. A section of the code operates on a vector of data and I would like to vectorize this operation. The idea is to split the elaboration of the array on multiple threads that will work on subsections of the array. I have to decide between two possibilities. The first one is to create the threads each time this section is encountered an rejoin them at the end with the main thread:
for(....)
{
//serial stuff
//crate threads
for(i = 0; i < num_threads; ++i)
{
threads_vect.push_back(std::thread(f, sub_array[i]));
}
//join them
for(auto& t : threads_vect)
{
t.join();
}
//serial stuff
}
This is similar at what it is done with OpenMP, but since the problem is simple I'd like to use std::threads instead of OpenMP (unless there are good reasons against this).
The second option is to create the threads beforehand to avoid the overhead of creation and destruction, and use condition variables for synchronization (I omitted a lot of stuff for the synchronization. It is just the general idea):
std::condition_variable cv_threads;
std::condition_variable cv_main;
//create threads, the will be to sleep on cv_threads
for(....)
{
//serial stuff
//wake up threads
cv_threads.notify_all();
//sleep until the last thread finishes, that will notify.
main_thread_lock.lock();
cv_main.wait(main_lock);
//serial stuff
}
To allow for parallelism the threads will have to unlock the thread_lock as soon as they wake up at the beginning of the computation, then acquire it again at to go to sleep and synchronize between them to notify the main thread.
My question is which of this solutions is usually preferred in a context like this, and if the avoided overhead of thread creation and destruction is usually worth the added complexity (or worth at all given that the added synchronization also adds time).
Obviously this also depends on how long the computation is for each thread, but this could vary a lot since the length of the data vector could also be very short (to about two element per thread, that would led to a computation time of about 15 milliseconds).
The biggest disadvantage of creating new threads is that thread creation and shutdown is usually quite expensive. Just think of all the things an operating system has to do to get a thread off the ground, compared to what it takes to notify a condition variable.
Note that synchronization is always required, also with thread creation. The C++11 std::thread for instances introduces a synchronizes-with relationship with the creating thread upon construction. So you can safely assume that thread creation will always be significantly more expensive than condition variable signalling, regardless of your implementation.
A framework like OpenMP will typically attempt to amortize these costs in some way. For instance, an OpenMP implementation is not required to shut down the worker threads after every loop and many implementations will not do this.
I have an application which has a couple of processing levels like:
InputStream->Pre-Processing->Computation->OutputStream
Each of these entities run in separate thread.
So in my code I have the general thread, which owns the
std::vector<ImageRead> m_readImages;
and then it passes this member variable to each thread:
InputStream input{&m_readImages};
std::thread threadStream{&InputStream::start, &InputStream};
PreProcess pre{&m_readImages};
std::thread preStream{&PreProcess::start, &PreProcess};
...
And each of these classes owns a pointer member to this data:
std::vector<ImageRead>* m_ptrReadImages;
I also have a global mutex defined, which I lock and unlock on each read/write operation to that shared container.
What bothers me is that this mechanism is pretty obscure and sometimes I get confused whether the data is used by another thread or not.
So what is the more straightforward way to share this container between those threads?
The process you described as "Input-->preprocessing-->computation-->Output" is sequential by design: each step depends on the previous one so parallelization in this particular manner is not beneficial as each thread just has to wait for another to complete. Try to find out which step takes most time and parallelize that. Or try to set up multiple parallel processing pipelines that operate sequentially on independent, individual data sets. A usual approach for that would employ a processing queue which distributes the tasks among a set of threads.
It would seem to me that your reading and preprocessing could be done independently of the container.
Naively, I would structure this as a fan-out and then fan-in network of tasks.
First, make dispatch task (a task is a unit of work that is given to a thread to actually operate) that will create input-and-preprocess tasks.
Use futures as a means for the sub-tasks to communicate back a pointer to the completely loaded image.
Make a second task, the std::vector builder task that just calls join on the futures to get the results when they are done and adds them to the std::vector array.
I suggest you structure things this way because I suspect that any IO and preprocessing you are doing will take longer than setting a value in the vector. Using tasks instead of threads directly lets you tune the parallel portion of your work.
I hope that's not too abstracted away from the concrete elements. This is a pattern I find to be well balanced between saturating available hardware, reducing thrash / lock contention, and is understandable by future-you debugging it later.
I would use 3 separate queues, ready_for_preprocessing which is fed by InputStream and consumed by Pre-processing, ready_for_computation which is fed by Pre-Processing and consumed by Computation, and ready_for_output which is fed by Computation and consumed by OutputStream.
You'll want each queue to be in a class, which has an access mutex (to control actually adding and removing items from the queue) and an "image available" semaphore (to signal that items are available) as well as the actual queue. This would allow multiple instances of each thread. Something like this:
class imageQueue
{
std::deque<ImageRead> m_readImages;
std::mutex m_changeQueue;
Semaphore m_imagesAvailable;
public:
bool addImage( ImageRead );
ImageRead getNextImage();
}
addImage() takes the m_changeQueue mutex, adds the image to m_readImages, then signals m_imagesAvailable;
getNextImage() waits on m_imagesAvailable. When it becomes signaled, it takes m_changeQueue, removes the next image from the list, and returns it.
cf. http://en.cppreference.com/w/cpp/thread
Ignoring the question of "Should each operation run in an individual thread", it appears that the objects that you want to process move from thread to thread. In effect, they are uniquely owned by only one thread at a time (no thread ever needs to access any data from other threads, ). There is a way to express just that in C++: std::unique_ptr.
Each step then only works on its owned image. All you have to do is find a thread-safe way to move the ownership of your images through the process steps one by one, which means the critical sections are only at the boundaries between tasks. Since you have multiple of these, abstracting it away would be reasonable:
class ProcessBoundary
{
public:
void setImage(std::unique_ptr<ImageRead> newImage)
{
while (running)
{
{
std::lock_guard<m_mutex> guard;
if (m_imageToTransfer == nullptr)
{
// Image has been transferred to next step, so we can place this one here.
m_imageToTransfer = std::move(m_newImage);
return;
}
}
std::this_thread::yield();
}
}
std::unique_ptr<ImageRead> getImage()
{
while (running)
{
{
std::lock_guard<m_mutex> guard;
if (m_imageToTransfer != nullptr)
{
// Image has been transferred to next step, so we can place this one here.
return std::move(m_imageToTransfer);
}
}
std::this_thread::yield();
}
}
void stop()
{
running = false;
}
private:
std::mutex m_mutex;
std::unique_ptr<ImageRead> m_imageToTransfer;
std::atomic<bool> running; // Set to true in constructor
};
The process steps would then ask for an image with getImage(), which they uniquely own once that function returns. They process it and pass it to the setImage of the next ProcessBoundary.
You could probably improve on this with condition variables, or adding a queue in this class so that threads can get back to processing the next image. However, if some steps are faster than others they will necessarily be stalled by the slower ones eventually.
This is a design pattern problem. I suggest to read about concurrency design pattern and see if there is anything that would help you out.
If you wan to add concurrency to the following sequential process.
InputStream->Pre-Processing->Computation->OutputStream
Then I suggest to use the active object design pattern. This way each process is not blocked by the previous step and can run concurrently. It is also very simple to implement(Here is an implementation:
http://www.drdobbs.com/parallel/prefer-using-active-objects-instead-of-n/225700095)
As to your question about each thread sharing a DTO. This is easily solved with a wrapper on the DTO. The wrapper will contain write and read functions. The write functions blocks with a mutext and the read returns const data.
However, I think your problem lies in design. If the process is sequential as you described, then why are each process sharing the data? The data should be passed into the next process once the current one completes. In other words, each process should be decoupled.
You are correct in using mutexes and locks. For C++11, this is really the most elegant way of accessing complex data between threads.
Note: This is the first post I have made on this site, but I have searched extensively and was not able to find a solution to my problem.
I have written a program which essentially tests all permutations of a vector of numbers to find an optimal sequence as defined by me. Of course, computing permutations of numbers is very time consuming even for small inputs, so I am trying to speed things up by using multithreading.
Here is a small sample which replicates the problem:
class TaskObject {
public:
void operator()() {
recursiveFunc();
}
private:
Solution *bestSolution; //Shared by every TaskObject, but can only be accessed by one at a time
void recursiveFunc() {
if (base_case) {
//Only part where shared object is accessed
//base_case is rarely reached
return;
}
recursiveFunc();
}
};
void runSolutionWithThreads() {
vector<thread> threads(std::thread::hardware_concurrency());
vector<TaskObject> tasks_vector(std::thread::hardware_concurrency());
updateTasks(); //Sets parameters that intialize the first call to recursiveFunc
for (int q = 0; q < (int)tasks_vector.size(); ++q) {
threads[q] = std::thread(tasks_vector[q]);
}
for (int i = 0; i < (int)threads.size(); ++i) {
threads[i].join();
}
}
I imagined that this would enable all threads to run in parallel, but I can see using the performance profiler in visual studio and in the advanced settings of windows task manager that only 1 thread is running at a time. On a system with access to 4 threads, the CPU gets bounded at 25%. I get correct output every time I run, so there are no issues with the algorithm logic. Work is spread out as evenly as possible among all task objects. Collisions with shared data rarely occur. Program implementation with thread pool always ran at nearly 100%.
The objects submitted to the threads don't print to cout and all have their own copies of the data required to perform their work except for one shared object they all reference by pointer.
private:
Solution* bestSolution;
This shared data is not susceptible to a data race condition since I used lock_guard from mutex to make it so only one thread can update bestSolution at a time.
In other words, why isn't my CPU running at nearly 100% for my multithreaded program which uses as many threads as there are available in the system?
I can readily update this post with more information if needed.
In debugging your application, use the debugger to "break all" threads. Then examine each thread with the debug thread window to see where each thread is executing. Likely you will find that only one thread is executing code, while the rest are all blocked on the mutex that the one running thread is holding.
If you show a more complete example of the code it can greatly assist.
I have written a program in QT using several threads for doing important stuff in the background. The target for this program is a BeagleBone Black (Single-Core-CPU). But in my tests on my computer (VM, 4 Cores of i7) the separate threads are already eating up two of the four cores as seen in htop (maybe because in two of them a while(condition){}-loop is running). How can I prevent these threads from eating up all my resources, so that I will be able to run this multi-thread-program without speed loss on a single-core-arm-cpu? How can I find out which threads are eating up all my cpu resources?
As you're using Qt, there's a better possibility for making your threads wait. You could indeed use QWaitConditions.
This allows you to make your thread block until a certain condition is met in another thread for example. This thread then can notify either all threads that are waiting that the condition has been met, and then wake them, or just one depending on your need (though through one QWaitCondition you can't determine/predict which one will be notified, that depends on the OS).
In case you need a more general resource about this topic (idleness), I invite you to read the article In praise of idleness which covers this topic more thoroughly.
Besides using waitConditions you can also use the event loop to sequence the code
What will need to happen is that each function in the form of:
void Worker::foo(){
//some code
while(!stop_condition){}
//other code
}
void Worker::signalCondition(){
stop_condition=true;
}
has to be translated to:
void Worker::foo(){
//some code
}
//actual slot called with queuedConnection
void Worker::signalCondition(){
//other code
}
This means Worker needs to be moved to the other thread or the signalCondition will be called on the wrong thread.
admittedly the code change for using QWaitConditions is simpler:
void Worker::foo(){
//some code
{
QMutexLocker locker(&mutex);
while(!stop_condition){
waitCondition.wait(&mutex);
}
}
//other code
}
void Worker::signalCondition(){
QMutexLocker locker(&mutex);
stop_condition=true;
waitCondition.wakeOne();
}
I am new to pthreads and would like to ask how to express something like:
while(imhappy())
{
#pragma omp sections
{
#pragma omp section
{
dothis();
}
#pragma omp section
{
dothat();
}
}
}
in an equivalent construct using fork() or vfork() ?
Thanks ahead!
PS: I included the while around the sections in case it is somehow more clever to fork before the entering the loop due to some resource cloning.
OpenMP does lots of other things behind the scenes than merely spawning threads. It also distributes code segments and synchronises the different threads. You have tagged your question as pthreads though you are asking about implementation with fork() which is confusing. In Linux fork() is very heavyweight as it creates new processes and rather clone() is used to create threads.
Nevertheless, the rough equivalent of an OpenMP sections construct with two threads would be to fork, afterwards an if construct has to follow and the master process would execute the dothis() path while the child will execute the dothat() path. The return value from fork() is different in the parent and in the child processes and can be used to make the decision for the branch. The parent process will then wait for the child to finish with waitpid() which will be analogous to the implicit barrier synchronisation at the end of the omp sections region.
One caveat though - fork() is implemented using COW (copy-on-write) pages. What it means is that although at the beginning the memory content of the child is equal to the memory content of the parent, any changes made are private - the child will not see what the parent modifies in its own memory and vice versa. Memory has to be explicitly shared between the two using either SysV shared memory primitives or shared file mappings.
You might really want to look into using POSIX threads API instead.
vfork() is a syscall designed for completely different purpose and is not suitable for process cloning at all.
I do not think that there is a straight simulation of sections using the fork.
However, you can theoretically simulate it using a messgae passing mechanism where the shared variables are kept with a root machine. All the openMP flush will happen using the root. (Remember that openMP uses weaker than weak consistency model).