Hybrid Parallelism: MPI and TBB

Hybrid Parallelism: MPI and TBB - c++

In TBB, the task_scheduler_init () method, which is often (and should be?) invoked internally, is a deliberate design decision.
However, if we mix TBB and MPI, is it guaranteed to be thread-safe without controlling the number of threads of each MPI process?
For example, say we have 7 cores (with no hyper-threading) and 2 MPI processes. If each process spawns an individual TBB task using 4 threads simultaneously, then there is a conflict which might cause the program to crash at runtime.
I'm a newbie of TBB.
Looking forward to your comments and suggestions!

From Intel TBB runtime perspective it does not matter if it is a MPI process or not. So if you have two processes you will have two independent instances of Intel TBB runtime. I am not sure that I understand the question related to thread-safeness but it should work correctly. However, you may have oversubscription that can lead to performance issues.
In addition, you should check the MPI implementation documentation if you use MPI routines concurrently (from multiple threads) because it can cause some issues.

Generally speaking, this is a two steps tango
MPI binds each task to some resources
the thread runtime (TBB, same things would apply to OpenMP) is generally smart enough to bind threads within the previously provided resources.
bottom line, if MPI binds its tasks to non overlapping resources, then there should be no conflict caused by the TBB runtime.
a typical scenario is to run 2 MPI tasks with 8 OpenMP threads per task on a dual socket octo core box. as long as MPI binds a task to a socket, and the OpenMP runtime is told to bind the threads to cores, performance will be optimal.

Related

10 threads in a single program or 1 thread program ran 10 times (C++)?

I am wondering whether there is any difference in performance in running a single program (exe) with 10 different threads or running the program with a single thread 10 times in parallel (starting it from a .bat file) assuming the work done is the same and only the number of threads spawned by the program change?
I am developing a client/server communication program and want to test it for throughput. I'm currently learning about parallel programming and threading as wasn't sure how Windows would handle the above scenario. Will the scheduler schedule work the same way for both scenarios? Will there be a performance difference?
The machine the program is running on has 4 threads.

Threads are slightly lighter weight than processes as there are many things a process gets it's own copy of. Especially when you compare the time it takes to start a new thread, vs starting a new process (from scratch, fork where available also avoids a lot of costs). Although in either case you can generally get even better performance using a worker pool where possible rather than starting and stopping fresh processes/threads.
The other major difference is that threads by default all share the same memory while processes get their own and need to communicate through more explicit means (which may include blocks of shared memory). This might make it easier for a threaded solution to avoid copying data, but this is also one of the dangers of multithreaded programming when care is not taken in how they use the shared memory/objects.
Also there may be API's that are more orientated around a single process. For example on Windows there is IO Completion Ports which basically works on the idea of having many in-progress IO operations for different files, sockets, etc. with multiple threads (but generally far less than the number of files/sockets) handling the results as they become available through a GetQueuedCompletionStatus loop.

Boost thread, Posix thread and STD thread, why do they offer different performances?

As Far as I know,
In computer science, a thread of execution is the smallest sequence of
programmed instructions that can be managed independently by a
scheduler, which is typically a part of the operating system. The
implementation of threads and processes differs between operating
systems, but in most cases a thread is a component of a process.
Multiple threads can exist within one process, executing concurrently
and sharing resources such as memory, while different processes do not
share these resources. In particular, the threads of a process share
its executable code and the values of its variables at any given time.[1]
When I decided to write a multi thread program in c++, i faced with many choices like boost thread, posix thread and std thread.
A simple search on internet shows a performance measurement taken by boost.org website here.
My question is a bit more basic and performance related as well.
Basically, Why do they differ in performance ? Why, for example thread type A, is faster than the others? The are written by most professional programmers, are ran by powerful OSs ,yet they offer different performance.
What makes them faster or slower?

The Boost documentation refers to the Fiber library, which are not actually threads. Creating what the library calls a fiber (essentially a user-space thread or coroutine, sometimes also referred to as green threads) does not create a separate schedulable entity on the kernel side, so it can be much more efficient at creation time. Other things could be less efficient because I/O operations necessarily become much more involved under this model (because a fiber doing I/O should not block the operating system thread it runs on if other fibers could do work there).
Note that some of the coroutine implementations out there are well out of the conceptual limits of the de-facto GNU/Linux ABI and other POSIX-like operating systems, so they should be considered ugly hacks at best.

Using multi core support in C / C++

I have seen in some posts it has been said that to use multiple cores of processor use Boost thread (use multi-threading) library. Usually threads are not visible to operating system. So how can we sure that multi-threading will support usage of multi-cores. Is there a difference between Java threads and Boost threads?

The operating system is also called a "supervisor" because it has access to everything. Since it is responsible for managing preemptive threads, it knows exactly how many a process has, and can inspect what they are doing at any time.
Java may add a layer of indirection (green threads) to make many threads look like one, depending on JVM and configuration. Boost does not do this, but instead only wraps the POSIX interface which usually communicates directly with the OS kernel.
Massively multithreaded applications may benefit from coalescing threads, so that the number of ready-to-run threads matches the number of logical CPU cores. Reducing everything to one thread may be going too far, though :v) and #Voo says that green threads are only a legacy technology. A good JVM should support true multithreading; check your configuration options. On the C++ side, there are libraries like Intel TBB and Apple GCD to help manage parallelism.

Multithreading vs multiprocessing

I am new to this kind of programming and need your point of view.
I have to build an application but I can't get it to compute fast enough. I have already tried Intel TBB, and it is easy to use, but I have never used other libraries.
In multiprocessor programming, I am reading about OpenMP and Boost for the multithreading, but I don't know their pros and cons.
In C++, when is multi threaded programming advantageous compared to multiprocessor programming and vice versa?Which is best suited to heavy computations or launching many tasks...? What are their pros and cons when we build an application designed with them? And finally, which library is best to work with?

Multithreading means exactly that, running multiple threads. This can be done on a uni-processor system, or on a multi-processor system.
On a single-processor system, when running multiple threads, the actual observation of the computer doing multiple things at the same time (i.e., multi-tasking) is an illusion, because what's really happening under the hood is that there is a software scheduler performing time-slicing on the single CPU. So only a single task is happening at any given time, but the scheduler is switching between tasks fast enough so that you never notice that there are multiple processes, threads, etc., contending for the same CPU resource.
On a multi-processor system, the need for time-slicing is reduced. The time-slicing effect is still there, because a modern OS could have hundred's of threads contending for two or more processors, and there is typically never a 1-to-1 relationship in the number of threads to the number of processing cores available. So at some point, a thread will have to stop and another thread starts on a CPU that the two threads are sharing. This is again handled by the OS's scheduler. That being said, with a multiprocessors system, you can have two things happening at the same time, unlike with the uni-processor system.
In the end, the two paradigms are really somewhat orthogonal in the sense that you will need multithreading whenever you want to have two or more tasks running asynchronously, but because of time-slicing, you do not necessarily need a multi-processor system to accomplish that. If you are trying to run multiple threads, and are doing a task that is highly parallel (i.e., trying to solve an integral), then yes, the more cores you can throw at a problem, the better. You won't necessarily need a 1-to-1 relationship between threads and processing cores, but at the same time, you don't want to spin off so many threads that you end up with tons of idle threads because they must wait to be scheduled on one of the available CPU cores. On the other hand, if your parallel tasks requires some sequential component, i.e., a thread will be waiting for the result from another thread before it can continue, then you may be able to run more threads with some type of barrier or synchronization method so that the threads that need to be idle are not spinning away using CPU time, and only the threads that need to run are contending for CPU resources.

There are a few important points that I believe should be added to the excellent answer by #Jason.
First, multithreading is not always an illusion even on a single processor - there are operations that do not involve the processor. These are mainly I/O - disk, network, terminal etc. The basic form for such operation is blocking or synchronous, i.e. your program waits until the operation is completed and then proceeds. While waiting, the CPU is switched to another process/thread.
if you have anything you can do during that time (e.g. background computation while waiting for user input, serving another request etc.) you have basically two options:
use asynchronous I/O: you call a non-blocking I/O providing it with a callback function, telling it "call this function when you are done". The call returns immediately and the I/O operation continues in the background. You go on with the other stuff.
use multithreading: you have a dedicated thread for each kind of task. While one waits for the blocking I/O call, the other goes on.
Both approaches are difficult programming paradigms, each has its pros and cons.
with async I/O the logic of the program's logic is less obvious and is difficult to follow and debug. However you avoid thread-safety issues.
with threads, the challange is to write thread-safe programs. Thread safety faults are nasty bugs that are quite difficult to reproduce. Over-use of locking can actually lead to degrading instead of improving the performance.
(coming to the multi-processing)
Multithreading made popular on Windows because manipulating processes is quite heavy on Windows (creating a process, context-switching etc.) as opposed to threads which are much more lightweight (at least this was the case when I worked on Win2K).
On Linux/Unix, processes are much more lightweight. Also (AFAIK) threads on Linux are implemented actually as a kind of processes internally, so there is no gain in context-switching of threads vs. processes. However, you need to use some form of IPC (inter-process communications), as shared memory, pipes, message queue etc.
On a more lite note, look at the SQLite FAQ, which declares "Threads are evil"! :)

To answer the first question:
The best approach is to just use multithreading techniques in your code until you get to the point where even that doesn't give you enough benefit. Assume the OS will handle delegation to multiple processors if they're available.
If you actually are working on a problem where multithreading isn't enough, even with multiple processors (or if you're running on an OS that isn't using its multiple processors), then you can worry about discovering how to get more power. Which might mean spawning processes across a network to other machines.
I haven't used TBB, but I have used IPP and found it to be efficient and well-designed. Boost is portable.

Just wanted to mention that the Flow-Based Programming ( http://www.jpaulmorrison.com/fbp ) paradigm is a naturally multiprogramming/multiprocessing approach to application development. It provides a consistent application view from high level to low level. The Java and C# implementations take advantage of all the processors on your machine, but the older C++ implementation only uses one processor. However, it could fairly easily be extended to use BOOST (or pthreads, I assume) by the use of locking on connections. I had started converting it to use fibers, but I'm not sure if there's any point in continuing on this route. :-) Feedback would be appreciated. BTW The Java and C# implementations can even intercommunicate using sockets.

Handling GUI thread in a program using OpenMP

I have a C++ program that performs some lengthy computation in parallel using OpenMP. Now that program also has to respond to user input and update some graphics. So far I've been starting my computations from the main / GUI thread, carefully balancing the workload so that it is neither to short to mask the OpenMP threading overhead nor to long so the GUI becomes unresponsive.
Clearly I'd like to fix that by running everything concurrently. As far as I can tell, OpenMP 2.5 doesn't provide a good mechanism for doing this. I assume it wasn't intended for this type of problem. I also wouldn't want to dedicate an entire core to the GUI thread, it just needs <10% of one for its work.
I thought maybe separating the computation into a separate pthread which launches the parallel constructs would be a good way of solving this. I coded this up but had OpenMP crash when invoked from the pthread, similar to this bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36242 . Note that I was not trying to launch parallel constructs from more than one thread at a time, OpenMP was only used in one pthread throughout the program.
It seems I can neither use OpenMP to schedule my GUI work concurrently nor use pthreads to have the parallel constructs run concurrently. I was thinking of just handling my GUI work in a separate thread, but that happens to be rather ugly in my case and might actually not work due to various libraries I use.
What's the textbook solution here? I'm sure others have used OpenMP in a program that needs to concurrently deal with a GUI / networking etc., but I haven't been able to find any information using Google or the OpenMP forum.
Thanks!

There is no textbook solution. The textbook application for OpenMP is non-interactive programs that read input files, do heavy computation, and write output files, all using the same thread pool of size ~ #CPUs in your supercomputer. It was not designed for concurrent execution of interactive and computation code and I don't think interop with any threads library is guaranteed by the spec.
Leaving theory aside, you seem to have encountered a bug in the GCC implementation of OpenMP. Please file a bug report with the GCC maintainers and for the time being, either look for a different compiler or run your GUI code in a separate process, communicating with the OpenMP program over some IPC mechanism. (E.g., async I/O over sockets.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js