I was wondering if in an MPI program where you specify that there is thread support, all the threads make an MPI::Bcast call (making shure that in the call, the sender process only possesses one thread), is this received by all the other threads or just for one thread from each process (the fastest)?
Common MPI implementations deal with communication among processes. Implementations supporting threads simply allow multiple threads to make some or all MPI calls, rather than just one. Every one of T threads in a process calling MPI_Bcast means that the process has called MPI_Bcast T times, and expects that all of the other ranks on the communicator will do the same.
Depending on the level of thread support in your implementation of MPI, (please check, threading support in MPI is very sketchy), MPI call is made only once per process.
To add to the answer given by Novelocrat:
The basic unit of computation in MPI is the "rank." In most (all?) interesting implementations of MPI, a rank IS a process. All of the threads within a process share the same Rank ID.
The MPI Standard supports multiple levels of thread parallelism: MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, MPI_THREAD_SERIALIZED, and MPI_THREAD_MULTIPLE.
Of these, only MPI_THREAD_MULTILE actually has multiple threads making overlapping calls into the MPI library. The other three cases are an assertion from the application that the Rank can be treated as if it were "single threaded." For more, see the MPI Standard entry on MPI_INIT_THREAD.
Related
I am wondering whether there is any difference in performance in running a single program (exe) with 10 different threads or running the program with a single thread 10 times in parallel (starting it from a .bat file) assuming the work done is the same and only the number of threads spawned by the program change?
I am developing a client/server communication program and want to test it for throughput. I'm currently learning about parallel programming and threading as wasn't sure how Windows would handle the above scenario. Will the scheduler schedule work the same way for both scenarios? Will there be a performance difference?
The machine the program is running on has 4 threads.
Threads are slightly lighter weight than processes as there are many things a process gets it's own copy of. Especially when you compare the time it takes to start a new thread, vs starting a new process (from scratch, fork where available also avoids a lot of costs). Although in either case you can generally get even better performance using a worker pool where possible rather than starting and stopping fresh processes/threads.
The other major difference is that threads by default all share the same memory while processes get their own and need to communicate through more explicit means (which may include blocks of shared memory). This might make it easier for a threaded solution to avoid copying data, but this is also one of the dangers of multithreaded programming when care is not taken in how they use the shared memory/objects.
Also there may be API's that are more orientated around a single process. For example on Windows there is IO Completion Ports which basically works on the idea of having many in-progress IO operations for different files, sockets, etc. with multiple threads (but generally far less than the number of files/sockets) handling the results as they become available through a GetQueuedCompletionStatus loop.
In TBB, the task_scheduler_init () method, which is often (and should be?) invoked internally, is a deliberate design decision.
However, if we mix TBB and MPI, is it guaranteed to be thread-safe without controlling the number of threads of each MPI process?
For example, say we have 7 cores (with no hyper-threading) and 2 MPI processes. If each process spawns an individual TBB task using 4 threads simultaneously, then there is a conflict which might cause the program to crash at runtime.
I'm a newbie of TBB.
Looking forward to your comments and suggestions!
From Intel TBB runtime perspective it does not matter if it is a MPI process or not. So if you have two processes you will have two independent instances of Intel TBB runtime. I am not sure that I understand the question related to thread-safeness but it should work correctly. However, you may have oversubscription that can lead to performance issues.
In addition, you should check the MPI implementation documentation if you use MPI routines concurrently (from multiple threads) because it can cause some issues.
Generally speaking, this is a two steps tango
MPI binds each task to some resources
the thread runtime (TBB, same things would apply to OpenMP) is generally smart enough to bind threads within the previously provided resources.
bottom line, if MPI binds its tasks to non overlapping resources, then there should be no conflict caused by the TBB runtime.
a typical scenario is to run 2 MPI tasks with 8 OpenMP threads per task on a dual socket octo core box. as long as MPI binds a task to a socket, and the OpenMP runtime is told to bind the threads to cores, performance will be optimal.
I am using two Boost threads, each of which uses different FFTW plan (example: thread 1 uses 'plan_fft' and thread 2 uses 'plan_ifft'). When I run only one thread (thread 2), it works perfectly, but when I run both threads, then I am getting a segmentation fault. I think it may be because of creation plan is not thread safe. It would be great help for me if someone provides solution to "how to use two different fftw_plans (each in one thread) in two threads in a parallel manner".
I forgot to mention one thing as solutions provided by FFTW multithreading developers:
using semaphore locks
creating all the plans in the one thread
I implemented the 2nd one (i.e created all the plans in main program and then called two threads from the main program). When I do so, there are no errors and segmentation fault, but I am not getting the result.
Please note: these two threads are independent and not sharing any common data, so I think a semaphore lock won't work for my case.
My doubt: can we create (and destroy) plans in the main program and execute these two different plans in two different threads?
The FFTW folks provide a nice summary to the thread safety topic here. Wrapup: nothing is thread safe except for fftw_execute, so you have to take care that e.g. only a single thread creates plans. However, it should be no problem to execute them in parallel.
This question already has an answer here:
Does a call to MPI_Barrier affect every thread in an MPI process?
(1 answer)
Closed 9 years ago.
Lets say I have 2 processes each with two threads (1 IO thread, 1 compute thread)
I am interessted in using some IO library (adios).
I am asking me what will happend if I would code something like this?:
lets say the IO threads in the 2 processes do some IO and they use
MPI_Barrier(MPI_COMM_WORLD) at some point B to synchronize the
IO!
the compute threads in the two processes also use the MPI_Barrier(MPI_COMM_WORLD) at some point A to synchronize the computation (while the IO threads are working).
---> I dont know exactly what might happen, is the following case possible:
Process 1, IO Thread waits at B
Process 2, Compute thread waits at A
=> and Process 1 and 2 get synchronized (so Process 1 leaves barrier at B and process 2 at A (also process 2 has not the same point where it synchronizes!)
If that might happen, isn't this an unwanted behavior which was not intended by the programmer. (Can that be avoided by using two different communicator with identical number of processes (MPI_Comm_dup(...) ) ?
Or is the barrier really code line dependent? But how is this realized if true so?
This is confusing!
Thanks a lot!
The first scenario is very likely to happen (barrier calls from different threads matching each other). From MPI's point of view a barrier must be entered by all ranks inside the communicator, no matter from which thread comes the barrier call and at which code line the call is. MPI still has no notion of thread identity and all threads are treated together as a single entity - a rank. The only special treatment is that when the MPI_THREAD_MULTIPLE thread support level is being provided, the library should implement proper locks so that MPI calls could be made from any thread and at any time.
That's why it is highly advisable that parallel library authors should always duplicate the world communicator and use the duplicate for internal communication needs. That way the library code won't interfere with the user code (with some special exceptions that could result in deadlocks).
In my c++ program, I use a proprietarry library (.dll) which is not thread-safe. :-(
In this library there is a specific scientific computing.
Is there a way to safety start several computing of this library in parallel with threads ?
(1 process, many threads)
My program "is like" a "for" loop on which call each time the computing of my not thread-safe library
Sounds like you want to load the DLL multiple times. Take a look at Load the same dll multiple times.
A very simple approach would be forking multiple slave processes in your for loop. The slave process loads the non-thread-safe module and does computations, and finally returns the result to the parent process via either a simple return code (if the result fits into 4 bytes), IPC, or file.
Of course, this approach assumes that the parallel computations do not need any interaction with the others.