pthread vs intel TBB and their relation to OpenMP?

pthread vs intel TBB and their relation to OpenMP? - c++

For multi-thread programming, with the considerations of combinations with HPC application (MPI), which one is better, can we say, in terms of functionality, Intel TBB (thread building block) is comparable to pthread or not? I only get experience in open mp, but I heard both TBB and Pthread offers finer thread control comparing to open mp, but can TBB or TBB+OpenMP offer similiar functionality compared to pthread?

pthread is a thin wrapper above the OS infrastructure. It allows you to create a thread with a given thread main function, and some synchronization primitives (mutexes semaphores etc). Under Linux pthread is implemented on top of the clone(2) system call. The equivilant under Windows is called CreateThread. All the other threading stuff is built on top of this base.
Intel TBB is higher level, it gives parallel_for and parallel_reduce and similiar higher level constructs similar to OpenMP but implemented as a library not a language extension.
OpenMPI is even higher level still with multi-machine distributed infrastructure, but it is very old fashioned and a little clunky.
My advice would be to learn the pthread library first until you completely understand it, and then look at higher level libraries afterward.

TBB allows you to write portable code on top of the native threading functionality, so it makes the code more portable over different OS architectures. I don't think it's "more efficient" than pthread.
I haven't used open MP personally, but in the past I've worked with developers using open MP (as a technical specialist on the processors they were using), and it seems to work reasonably well for certain things, but others are harder to use in open mp than writing your own code. It all depends on what exactly you are doing. One of the benefits with openmp of course is that you can always recompile the code without the openmp option, and the code just works directly as you expect it to [but not spread out, of course].
With a programmes threading approach, you can have much more control over exactly what happens on what thread, yes. But it also means a lot more work...

Related

What libraries should I use for better OCaml Threading?

I have asked a related question before Why OCaml's threading is considered as `not enough`?
No matter how "bad" ocaml's threading is, I notice some libraries say they can do real threading.
For example, Lwt
Lwt offers a new alternative. It provides very light-weight
cooperative threads; ``launching'' a thread is a very fast operation,
it does not require a new stack, a new process, or anything else.
Moreover context switches are very fast. In fact, it is so easy that
we will launch a thread for every system call. And composing
cooperative threads will allow us to write highly asynchronous
programs.
Also Jane Street's aync_core also provides similar things, if I am right.
But I am quite confused. Do Lwt or aync_core provide threading like Java threading?
If I use them, can I utilise multiple cpu?
In what way, can I get a "real threading" (just like in Java) in OCaml?
Edit
I am still confused.
Let me add a scenario:
I have a server (16 cpu cores) and a server application.
What the server application does are:
It listens to requests
For each request, it starts a computational task (let's say costs 2 minutes to finish)
When each task finishes, the task will either return the result back to the main or just send the result back to client directly
In Java, it is very easy. I create a thread pool, then for each request, I create a thread in that pool. that thread will run the computational task. This is mature in Java and it can utilize the 16 cpu cores. Am I right?
So my question is: can I do the same thing in OCaml?

The example of parallelized server that you cite is one of those embarassingly parallel problem that are well solved with a simple multiprocessing model, using fork. This has been doable in OCaml for decades, and yes, you will an almost linear speedup using all the cores of your machine if you need.
To do that using the simple primitives of the standard library, see this Chapter of the online book "Unix system programming in OCaml" (first released in 2003), and/or this chapter of the online book "Developing Applications with OCaml" (first released in 2000).
You may also want to use higher-level libraries such as Gerd Stolpmann's OCamlnet library mentioned by rafix, which provides a lot of stuff from direct helper for the usual client/server design, to lower-level multiprocess communication libraries; see the documentation.
The library Parmap is also interesting, but maybe for slightly different use case (it's more that you have a large array of data available all at the same time, that you want to process with the same function in parallel): a drop-in remplacement of Array.map or List.map (or fold) that parallelizes computations.

The closest thing you will find to real (preemptive) threading is the built in threading library. By that mean I mean that your programming model will be the same but with 2 important differences:
OCaml's native threads are not lightweight like Java's.
Only a single thread executes at a time, so you cannot take advantage of multiple processes.
This makes OCaml's threads a pretty bad solution to either concurrency or parallelism so in general people avoid using them. But they still do have their uses.
Lwt and Async are very similar and provide you with a different flavour of threading - a cooperative style. Cooperative threads differ from preemptive ones in the fact context switching between threads is explicit in the code and blocking calls are always apparent from the type signature. The cooperative threads provided are very cheap so very well suited for concurrency but again will not help you with parallelilsm (due to the limitations of OCaml's runtime).
See this for a good introduction to cooperative threading: http://janestreet.github.io/guide-async.html
EDIT: for your particular scenario I would use Parmap, if the tasks are so computationally intensive as in your example then the overhead of starting the processes from parmap should be negligible.

GNU pth vs. pthread

I want to build a portable and efficient server in C++; it will have lots of clients trying to connect at the same time, so it must be able of handling each request parallel.
I have been trying to find documentation, guides... etc. for multithreading. I have found a lot about POSIX Pthread, but almost nothing for GNU Pth (apart from the official manual in gnu.org).
So, can anyone explain me the difference between POSIX Pthread and GNU Pth? Please, I want the response not to be a copy of Wikipedia's contents (keep in mind that I'm an absolute newbie to multithreading). I want my server to be portable and efficient between all *nix-based systems, keeping away of using heavy fork()s.
Thanks for your help.
PS: I think it's better to ask this here: what about Windows? Are Pthreads or Pth an option there? If not, what is the API for that operating system?

Use Pthreads, it's much more widely used, so there is far more information and support available for it. I've never met anyone who actually uses GNU Pth. Or better yet if you are using C++11 use std::thread and if not then use boost::thread.
So, can anyone explain me the difference between POSIX Pthread and GNU Pth?
Pthreads is a cross-platform standard for pre-emptible multithreading, meaning (usually) the OS kernel manages the threads and the OS scheduler decides when each thread gets to run (if you have a single core only one thread can run at a time, if you have multiple cores multiple threads can run at a time). The OS scheduler could pause any thread at (almost) any time and let another thread run, so each thread gets a limited "time slice" and then other threads get to run.
GNU Pth is a non-preemptible user-space threading library, meaning the threads and which ones run at which time are decided in user-space not by the kernel. Some people say programs using non-preemptible threading libraries are easier to understand, because your thread won't get paused at arbitrary times for another thread to run.
I want my server to be portable and efficient between all *nix-based systems, keeping away of using heavy fork()s.
fork is not heavy on UNIX.
what about W*ndows? Are Pthreads or Pth an option there? If not, what is the API for that operating system?
There are pthreads APIs for Windows, but they're not native to the Windows OS. I don't know if GNU Pth works on Windows - I doubt it, unless you use Cygwin. Windows has its own Win32 thread model.
Using std::thread or boost::thread is portable to POSIX platforms and Windows, and makes certain parts of the API easier to use (specifically, locking and unlocking mutexes can be easily done in an exception safe way and condition variables are easier to use.)

Gnu PTH is for a very limited use case: you want to use a multi-threaded implementation paradigm but you don't want to use multiple CPUs or cores and you don't want to rely on any OS or kernel-level support. Since almost all general-purpose CPUs now have multiple cores, this use case is increasingly irrelevant.
Windows has a separate threading model from POSIX; if you want your application to be cross-platform it is best to use a cross-platform threading library such as boost::thread.

I think GNUs PTH is meaned for C in the first place. You can use it on C++ too but C++ have its own anyway.
There are quite some applications using pth like low-level burning tools (and so GUI-Tools like K3B and Brasero depend on pth), also GnuPG uses PTH, the package management of Archlinux and some multimedia stuff.
On Windows its always a bit complicated. Microsoft did never get over the fact that C is the Programming Language from/for UNIX-Systems and so is suffering the NIH Symptome (Not Invented Here)
So they do a lot of stuff without any advantage just to be different.
If you use an Application which should run everywhere and its not low-level, use Qt with its QThreads and QThreadPool
Its 100% the same on all operating systems
You need much less code
If you write an "low-level" application i recommend to split your applications into backends and frontends and write a own backend for each OS and use the library which will do the least problems.

How make a threading mechanism in C++?

I know there are some threading libraries for C++, like Pthread, Boost etc out there, but how are they working? There must be an implementation of the logic somewhere.
Let's say that I would like to write my own threading mechanism in C++, not using any library, how would I start? What should I have in mind when writing it?

You'd directly call the underlying API calls in the operating system. For example, CreateThread. Naturally, this is cumbersome and platform-specific, which is why we like to use portable C++ threading libraries...

In C++98/03, there is no notion of a "thread", so the question cannot be answered within the language. In C++11, the answer is to use <thread>.
On the implementation side, threading is an operating system feature. The operating system already has to schedule multiple processes (i.e. separate programs), and a multi-threading OS adds to that the ability to schedule multiple threads within one process. A the very heart, the OS may or may not take advantage of having physically more than one CPU (though that also applies to simple multi-processing; and conversely you can schedule multiple threads on a single CPU). At the heart of the programming, you will need hardware support for synchronisation primitives like atomic read/writes and atomic compare-and-swap to implement correct memory access. (This is not needed for only multi-processing, because separate processes have distinct memory; although it will be needed by the OS itself if there are multiple physical CPUs in use.)

Well, you need something which is able to run several threads.
If you are working on developing an operating system kernel on the bare metal, I think that current multi-core processors have only one core working after their power-on reset. Even the BIOS on most PCs probably keep only one core working (and the other cores idle). So you'll need to write (assembly, non-portable) code to start other cores.
And (as James reminded you), most of the time you are using some operating system kernel. For instance, on Linux (I don't know about Windows), threads are known by the kernel (because the tasks it is scheduling are threads) and they need to be initiated by the Linux clone(2) system call.
Often, kernel threads are quite heavy, and the system has a library (NPTL for Linux Posix threads) which may use fewer kernel threads than user threads (actually Linux NPTL is a 1:1 mapping between kernel and user threads, but on some other systems, like probably Solaris, things are different).

You can't write your own threading mechanism, unless you mean pseudo-threads like co-routines and not actual concurrently executing threads. This is because the fundamental thread mechanism is defined by the kernel and you can't change it nor implement your own. Any library you write must fall back, eventually, to the operating system.

What should I know about multithreading and when to use it, mainly in c++

I have never come across multithreading but I hear about it everywhere. What should I know about it and when should I use it? I code mainly in c++.

Mostly, you will need to learn about MT libraries on OS on which your application needs to run. Until and unless C++0x becomes a reality (which is a long way as it looks now), there is no support from the language proper or the standard library for threads. I suggest you take a look at the POSIX standard pthreads library for *nix and Windows threads to get started.

This is my opinion, but the biggest issue with multithreading is that it is difficult. I don't mean that from an experienced programmer point of view, I mean it conceptually. There really are a lot of difficult concurrency problems that appear once you dive into parallel programming. This is well known, and there are many approaches taken to make concurrency easier for the application developer. Functional languages have become a lot more popular because of their lack of side effects and idempotency. Some vendors choose to hide the concurrency behind API's (like Apple's Core Animation).
Multitheaded programs can see some huge gains in performance (both in user perception and actual amount of work done), but you do have to spend time to understand the interactions that your code and data structures make.

MSDN Multithreading for Rookies article is probably worth reading. Being from Microsoft, it's written in terms of what Microsoft OSes support(ed in 1993), but most of the basic ideas apply equally to other systems, with suitable renaming of functions and such.

That is a huge subject.
A few points...
With multi-core, the importance of multi-threading is now huge. If you aren't multithreading, you aren't getting the full performance capability of the machine.
Multi-threading is hard. Communicating and synchronization between threads is tricky to get right. Problems are often intermittent, hard to diagnose, and if the design isn't right for multi-threading, hard to fix.
Multi-threading is currently mostly non-portable and platform specific.
There are portable libraries with wrappers around threading APIs. Boost is one. wxWidgets (mainly a GUI library) is another. It can be done reasonably portably, but you won't have all the options you get from platform-specific APIs.

I've got an introduction to multithreading that you might find useful.
In this article there isn't a single
line of code and it's not aimed at
teaching the intricacies of
multithreaded programming in any given
programming language but to give a
short introduction, focusing primarily
on how and especially why and when
multithreaded programming would be
useful.

Here's a link to a good tutorial on POSIX threads programming (with diagrams) to get you started. While this tutorial is pthread specific, many of the concepts transfer to other systems.
To understand more about when to use threads, it helps to have a basic understanding of parallel programming. Here's a link to a tutorial on the very basics of parallel computing intended for those who are just becoming acquainted with the subject.

The other replies covered the how part, I'll briefly mention when to use multithreading.
The main alternative to multithreading is using a timer. Consider for example that you need to update a little label on your form with the existence of a file. If the file exists, you need to draw a special icon or something. Now if you use a timer with a low timeout, you can achieve basically the same thing, a function that polls if the file exists very frequently and updates your ui. No extra hassle.
But your function is doing a lot of unnecessary work, isn't it. The OS provides a "hey this file has been created" primitive that puts your thread to sleep until your file is ready. Obviously you can't use this from the ui thread or your entire application would freeze, so instead you spawn a new thread and set it to wait on the file creation event.
Now your application is using as little cpu as possible because of the fact that threads can wait on events (be it with mutexes or events). Say your file is ready however. You can't update your ui from different threads because all hell would break loose if 2 threads try to change the same bit of memory at the same time. In fact this is so bad that windows flat out rejects your attempts to do it at all.
So now you need either a synchronization mechanism of sorts to communicate with the ui one after the other (serially) so you don't step on eachother's toes, but you can't code the main thread part because the ui loop is hidden deep inside windows.
The other alternative is to use another way to communicate between threads. In this case, you might use PostMessage to post a message to the main ui loop that the file has been found and to do its job.
Now if your work can't be waited upon and can't be split nicely into little bits (for use in a short-timeout timer), all you have left is another thread and all the synchronization issues that arise from it.
It might be worth it. Or it might bite you in the ass after days and days, potentially weeks, of debugging the odd race condition you missed. It might pay off to spend a long time first to try to split it up into little bits for use with a timer. Even if you can't, the few cases where you can will outweigh the time cost.

You should know that it's hard. Some people think it's impossibly hard, that there's no practical way to verify that a program is thread safe. Dr. Hipp, author of sqlite, states that thread are evil. This article covers the problems with threads in detail.
The Chrome browser uses processes instead of threads, and tools like Stackless Python avoid hardware-supported threads in favor of interpreter-supported "micro-threads". Even things like web servers, where you'd think threading would be a perfect fit, and moving towards event driven architectures.
I myself wouldn't say it's impossible: many people have tried and succeeded. But there's no doubt writting production quality multi-threaded code is really hard. Successful multi-threaded applications tend to use only a few, predetermined threads with just a few carefully analyzed points of communication. For example a game with just two threads, physics and rendering, or a GUI app with a UI thread and background thread, and nothing else. A program that's spawning and joining threads throughout the code base will certainly have many impossible-to-find intermittent bugs.
It's particularly hard in C++, for two reasons:
the current version of the standard doesn't mention threads at all. All threading libraries and platform and implementation specific.
The scope of what's considered an atomic operation is rather narrow compared to a language like Java.
cross-platform libraries like boost Threads mitigate this somewhat. The future C++0x will introduce some threading support. But boost also has good interprocess communication support you could use to avoid threads altogether.
If you know nothing else about threading than that it's hard and should be treated with respect, than you know more than 99% of programmers.
If after all that, you're still interested in starting down the long hard road towards being able to write a multi-threaded C++ program that won't segfault at random, then I recommend starting with Boost threads. They're well documented, high level, and work cross platform. The concepts (mutexes, locks, futures) are the same few key concepts present in all threading libraries.

Why do libraries implement their own basic locks on windows?

Windows provides a number of objects useful for synchronising threads, such as event (with SetEvent and WaitForSingleObject), mutexes and critical sections.
Personally I have always used them, especially critical sections since I'm pretty certain they incur very little overhead unless already locked. However, looking at a number of libraries, such as boost, people then to go to a lot of trouble to implement their own locks using the interlocked methods on Windows.
I can understand why people would write lock-less queues and such, since thats a specialised case, but is there any reason why people choose to implement their own versions of the basic synchronisation objects?

Libraries aren't implementing their own locks. That is pretty much impossible to do without OS support.
What they are doing is simply wrapping the OS-provided locking mechanisms.
Boost does it for a couple of reasons:
They're able to provide a much better designed locking API, taking advantage of C++ features. The Windows API is C only, and not very well-designed C, at that.
They are able to offer a degree of portability. the same Boost API can be used if you run your application on a Linux machine or on Mac. Windows' own API is obviously Windows-specific.
The Windows-provided mechanisms have a glaring disadvantage: They require you to include windows.h, which you may want to avoid for a large number of reasons, not least its extreme macro abuse polluting the global namespace.

One particular reason I can think of is portability. Windows locks are just fine on their own but they are not portable to other platforms. A library which wishes to be portable must implement their own lock to guarantee the same semantics across platforms.

In many libraries (aka Boost) you need to write corss platform code. So, using WaitForSingleObject and SetEvent are no-go. Also, there common idioms, like Monitors, Conditions that Win32 API misses, (but it can be implemented using these basic primitives)
Some lock-free data structures like atomic counter are very useful; for example: boost::shared_ptr uses them in order to make it thread safe without overhead of critical section, most compilers (not msvc) use atomic counters in order to implement thread safe copy-on-write std::string.
Some things like queues, can be implemented very efficiently in thread safe way without locks at all that may give significant perfomance boost in certain applications.

There may occasionally be good reasons for implementing your own locks that don't use the Windows OS synchronization objects. But doing so is a "sharp stick." It's easy to poke yourself in the foot.
Here's an example: If you know that you are running the same number of threads as there are hardware contexts, and if the latency of waking up one of those threads which is waiting for a lock is very important to you, you might choose a spin lock implemented completely in user space. If the waiting thread is the only thread spinning on the lock, the latency of transferring the lock from the thread that owns it to the waiting thread is just the latency of moving the cache line to the owner thread and back to the waiting thread -- orders of magnitude faster than the latency of signaling a thread with an OS lock under the same circumstances.
But the scenarios where you want to do this is pretty narrow. As soon as you start having more software threads than hardware threads, you'll likely regret it. In that scenario, you could spend entire OS scheduling quanta doing nothing but spinning on your spin lock. And, if you care about power, spinlocks are bad because they prevent the processor from going into a low-power state.
I'm not sure I buy the portability argument. Portable libraries often have an OS portability layer that abstracts the different OS APIs for synchronization. If you're dealing with locks, a pthread_mutex can be made semantically the same as a Windows Mutex or Critical Section under an abstraction layer. There's some exceptions here, but for most people this is true. If you're dealing with Windows Events or POSIX condition variables, well, those are tougher to abstract. (Vista did introduce POSIX-style condition variables, but not many Windows software developers are in a position to require Vista...)

Writing locking code for a library is useful if that library is meant to be cross platform. Users of the library can use the library's locking functionality and not have to care about the underlying platform implementation. Assuming the library has versions for all the platforms being targetted it's one less bit of code that has to be ported.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js