Why do libraries implement their own basic locks on windows? - c++

Windows provides a number of objects useful for synchronising threads, such as event (with SetEvent and WaitForSingleObject), mutexes and critical sections.
Personally I have always used them, especially critical sections since I'm pretty certain they incur very little overhead unless already locked. However, looking at a number of libraries, such as boost, people then to go to a lot of trouble to implement their own locks using the interlocked methods on Windows.
I can understand why people would write lock-less queues and such, since thats a specialised case, but is there any reason why people choose to implement their own versions of the basic synchronisation objects?

Libraries aren't implementing their own locks. That is pretty much impossible to do without OS support.
What they are doing is simply wrapping the OS-provided locking mechanisms.
Boost does it for a couple of reasons:
They're able to provide a much better designed locking API, taking advantage of C++ features. The Windows API is C only, and not very well-designed C, at that.
They are able to offer a degree of portability. the same Boost API can be used if you run your application on a Linux machine or on Mac. Windows' own API is obviously Windows-specific.
The Windows-provided mechanisms have a glaring disadvantage: They require you to include windows.h, which you may want to avoid for a large number of reasons, not least its extreme macro abuse polluting the global namespace.

One particular reason I can think of is portability. Windows locks are just fine on their own but they are not portable to other platforms. A library which wishes to be portable must implement their own lock to guarantee the same semantics across platforms.

In many libraries (aka Boost) you need to write corss platform code. So, using WaitForSingleObject and SetEvent are no-go. Also, there common idioms, like Monitors, Conditions that Win32 API misses, (but it can be implemented using these basic primitives)
Some lock-free data structures like atomic counter are very useful; for example: boost::shared_ptr uses them in order to make it thread safe without overhead of critical section, most compilers (not msvc) use atomic counters in order to implement thread safe copy-on-write std::string.
Some things like queues, can be implemented very efficiently in thread safe way without locks at all that may give significant perfomance boost in certain applications.

There may occasionally be good reasons for implementing your own locks that don't use the Windows OS synchronization objects. But doing so is a "sharp stick." It's easy to poke yourself in the foot.
Here's an example: If you know that you are running the same number of threads as there are hardware contexts, and if the latency of waking up one of those threads which is waiting for a lock is very important to you, you might choose a spin lock implemented completely in user space. If the waiting thread is the only thread spinning on the lock, the latency of transferring the lock from the thread that owns it to the waiting thread is just the latency of moving the cache line to the owner thread and back to the waiting thread -- orders of magnitude faster than the latency of signaling a thread with an OS lock under the same circumstances.
But the scenarios where you want to do this is pretty narrow. As soon as you start having more software threads than hardware threads, you'll likely regret it. In that scenario, you could spend entire OS scheduling quanta doing nothing but spinning on your spin lock. And, if you care about power, spinlocks are bad because they prevent the processor from going into a low-power state.
I'm not sure I buy the portability argument. Portable libraries often have an OS portability layer that abstracts the different OS APIs for synchronization. If you're dealing with locks, a pthread_mutex can be made semantically the same as a Windows Mutex or Critical Section under an abstraction layer. There's some exceptions here, but for most people this is true. If you're dealing with Windows Events or POSIX condition variables, well, those are tougher to abstract. (Vista did introduce POSIX-style condition variables, but not many Windows software developers are in a position to require Vista...)

Writing locking code for a library is useful if that library is meant to be cross platform. Users of the library can use the library's locking functionality and not have to care about the underlying platform implementation. Assuming the library has versions for all the platforms being targetted it's one less bit of code that has to be ported.

Related

How to create a single process mutex within C++?

So I'm reading about monitors vs mutexes and finding mentions that suggest that monitors are faster mutexes because they don't lock system wide but rather only across the threads of a given process.
Is there some way in C++ to accomplish or simulate this?
Edit: I'm curious now what the difference is between system wide mutex and one restricted to a specific process.
C++ Standard does not define system-wide vs per-process primitives. So C++ does not specify whether std::mutex is system-wide.
Reasonable implementations have efficient per-process std::mutex; to have system-wide mutex you'll need to use libraries or operating system objects for your platform
The difference is that per-process mutex may use any memory operations to avoid system calls, as the process memory is shared among process's threads. Atomic operation on that memory are more efficient, and system call is often avoided via them. System-wide mutex will either start with system calls (not efficient), or will have to use shared memory (might be unsafe, also still may have some overhead).
The answer by #Alex Guteniev is as accurate as one can get (and should be considered the accepted answer). It states that the c++ standard doesn't define a system wide concept, and that mutexes for all practical purposes are per process i.e for synchronization between threads (execution agents) in a single process (and therefore according to your needs). The C++ makes it clear what a thread (std::thread) is (33.3 - ... intended to map one-to-one with OS threads (in my draft, at least...N4687)).
Microsoft post VC2015 has improved their implementation to use windows primitives as stated here. This is also indicated here in the most upvoted answer. I've also looked at the boost library implementations (which often precedes/influences the c++ standard) for microsoft and (AFAICT) it doesn't use any inter-process calls.
So to answer your question. In C++ threads and monitors are practically the same thing if this definition is to be considered accurate.
Update, stumbled across the answer to this while researching something related.
On Windows, Critical Sections can be used for single processes instead of system wide mutexes and are often faster:
Edit:
While the above statement is correct, c++ doesn't have the concept system wide mutex. This concept only exists when using OS specific primitives such as win32 CreateMutex and is not relevant to std c++.
Source:
std::mutex performance compared to win32 CRITICAL_SECTION
On Linux, pthreads are for processes.

Does Boost have support for Windows EnterCriticalSection API?

I know Boost has support for mutexes and lock_guard, which can be used to implement critical sections.
But Windows has a special API for critical sections (see EnterCriticalSection and LeaveCriticalSection) which is a LOT faster than a mutex (for rarely contended, short sections of code).
Hence my question - it is possible in Boost to take advantage of this API, and fallback to spinlock/mutex/futex-based implementation on other platforms?
The simple answer is no.
Here's some relevant background from an old mailing list thread:
BTW. I am agree that mutex is more universal solution from a
performance point of view. But to be fair - CS are faster in simple
design. I believe that possibility to support them should be at
least
taken in account.
This was the article that someone pointed me to. The conclusion was
that CS are only faster if:
There are less than 8 threads total in the process.
You weren't running in the background.
You weren't on an dual processor machine.
To me this means that simple testing yields good CS performance
results, but any real world program is better off with a full blown
mutex.
I'm not adverse to supporting a CS implementation. However, I
originally chose not to for the following reasons:
You get either construction and destruction hits from using a PIMPL
idiom or you must include Windows.h in the Boost.Threads headers,
which I simply don't want to do. (This can be worked around by
emulating a CS ala OPTEX from the MSDN.)
According to this research paper most programs won't benefit from
a CS design.
It's trivial to code a (non-portable) critical_section class that
follows the Mutex model if you truly can make use of this.
For now I think I've made the right choice, though down the road we
may change the implementation to use a critical section or OPTEX.
Bill Kempf
Speaking as someone who helps out maintaining Boost.Thread, and as someone who failed to get an event object into Boost.Thread, I don't think critical sections have ever been added nor would be added to Boost for these reasons:
A Win32 critical section is trivially easy to build using a boost::atomic and a boost::condition_variable, so much so it isn't really worth having an official one. Here is probably the most complex one you could imagine, but extremely configurable including being constexpr ready (don't ask!): https://github.com/ned14/boost.outcome/blob/master/include/boost/outcome/v1/spinlock.hpp#L331
You can build your own simply by matching (Basic)Lockable concept and using atomic compare_exchange (non-x86/x64) or atomic exchange (x86/x64) and then grab it using a lock_guard around the critical section.
Some may object that a win32 critical section is not this. I am afraid it is: it simply spins on an atomic for a spin count, and then lazily tries to allocate a win32 event object which it then waits upon. Nothing special.
As much as you might think critical sections (really user mode mutexes) are better/faster/whatever, they probably are not as great as you might think. boost::mutex is a big vast heavyweight thing on Windows internally using a win32 semaphore as the kernel wait object because of the need to emulate thread cancellation and to behave well in a general purpose use context. It's easy to write a concurrency structure which is faster than another for some single use case, but it is very very hard to write a concurrency structure which is all of:
Faster than a standard implementation in the uncontended case.
Faster than a standard implementation in the lightly contended case.
Faster than a standard implementation in the heavily contended case.
Even if you manage all three of the above, that still isn't enough: you also need some guarantees on worst case progression ordering, so whether certain patterns of locks, waits and unlocks produce predictable outcomes. This is why threading facilities can appear to look slow in narrow use case scenarios, so Boost.Thread much as the STL can appear to be much slower than hand rolled locking code in say an uncontended use case.
Boost.Thread already does substantial work in user mode to avoid going to kernel sleep on Windows. On POSIX any of the major pthreads implementations also does substantial work to avoid kernel sleeps and hence Boost.Thread doesn't replicate that work. In other words, critical sections don't gain you anything in terms of scaling to load behaviours, though inevitably Boost.Thread v4 especially on Windows does a ton load of work a naive implementation does not (the planned rewrite of Boost.Thread is vastly more efficient on Windows as it can assume Windows Vista or above).
So, it looks like the default Boost mutex doesn't support it, but asio::detail::mutex does.
So I ended up using that:
#include <boost/asio/detail/mutex.hpp>
#include <boost/thread.hpp>
using boost::asio::detail::mutex;
using boost::lock_guard;
int myFunc()
{
static mutex mtx;
lock_guard<mutex> lock(mtx);
. . .
}

How to ensure that std::thread are created in multi core?

I am using visual studio 2012. I have a module, where, I have to read a huge set of files from the hard disk after traversing their corresponding paths through an xml. For this i am doing
std::vector<std::thread> m_ThreadList;
In a while loop I am pushing back a new thread into this vector, something like
m_ThreadList.push_back(std::thread(&MyClass::Readfile, &MyClassObject, filepath,std::ref(polygon)));
My C++11 multi threading knowledge is limited.The question that I have here , is , how do create a thread on a specific core ? I know of parallel_for and parallel_for_each in vs2012, that make optimum use of the cores. But, is there a way to do this using standard C++11?
As pointed out in other comments, you cannot create a thread "on a specific core", as C++ has no knowledge of such architectural details. Moreover, in the majority of cases, the operating system will be able to manage the distribution of threads among cores/processors well enough.
That said, there exist cases in which forcing a specific distribution of threads among cores can be beneficial for performance. As an example, by forcing a thread to execute onto a one specific core it might be possible to minimise data movement between different processor caches (which can be critical for performance in certain memory-bound scenarios).
If you want to go down this road, you will have to look into platform-specific routines. E.g., for GNU/linux with POSIX threads you will want pthread_setaffinity_np(), in FreeBSD cpuset_setaffinity(), in Windows SetThreadAffinityMask(), etc.
I have some relevant code snippets here if you are interested:
http://gitorious.org/piranhapp0x/mainline/blobs/master/src/thread_management.cpp
I'm fairly certain that core affinity isn't included in std::thread. The assumption is that the OS is perfectly capable of making best possible use of the cores available. In all but the most extreme of cases you're not to going to beat the OS's decision, so the assumption is a fair one.
If you do go down that route then you have to add some decision making to your code to take account of machine architecture to ensure that your decision is better than the OSes on every machine you run on. That takes a lot of effort! For starters you'll be wanting to limit the number of threads to match the number of cores on the computer. And you don't have any knowledge of what else is going on in the machine; the OS does!
Which is why thread pools exist. They tend by default to have as many threads as there are cores, automatically set up by the language runtime. AFAIK C++11 doesn't have one of those. So the one good thing you can do to get the optimum performance is to find out how many cores there are and limit the number of threads you have to that number. Otherwise it's probably just best to trust the OS.
Joachim Pileborg's comment is well worth paying attention to, unless the work done by each thread outweighs the I/O overhead.
As a quick overview of threading in the context of dispatching threads to cores:
Most modern OS's make use of kernel level threads, or hybrid. With kernel level threading, the OS "sees" all the threads in each process; in contrast to user level threads, which are employed in Java, where the OS sees a single process, and has no knowledge of threading. Now, because, with kernel level threading, the OS can recognise the separate threads of a process, and manages their dispatch onto a given core, there is the potential for true parallelism - where multiple threads of the same process are run on different cores. You, as the programmer, will have no control over this however, when employing std::thread; the OS decides. With user level threading, all the management of threads are done at the user level, with Java, a library manages the "dispatch". In the case of hybrid threading, kernel threading is used, where each kernel thread is actually a set of user level threads.

How make a threading mechanism in C++?

I know there are some threading libraries for C++, like Pthread, Boost etc out there, but how are they working? There must be an implementation of the logic somewhere.
Let's say that I would like to write my own threading mechanism in C++, not using any library, how would I start? What should I have in mind when writing it?
You'd directly call the underlying API calls in the operating system. For example, CreateThread. Naturally, this is cumbersome and platform-specific, which is why we like to use portable C++ threading libraries...
In C++98/03, there is no notion of a "thread", so the question cannot be answered within the language. In C++11, the answer is to use <thread>.
On the implementation side, threading is an operating system feature. The operating system already has to schedule multiple processes (i.e. separate programs), and a multi-threading OS adds to that the ability to schedule multiple threads within one process. A the very heart, the OS may or may not take advantage of having physically more than one CPU (though that also applies to simple multi-processing; and conversely you can schedule multiple threads on a single CPU). At the heart of the programming, you will need hardware support for synchronisation primitives like atomic read/writes and atomic compare-and-swap to implement correct memory access. (This is not needed for only multi-processing, because separate processes have distinct memory; although it will be needed by the OS itself if there are multiple physical CPUs in use.)
Well, you need something which is able to run several threads.
If you are working on developing an operating system kernel on the bare metal, I think that current multi-core processors have only one core working after their power-on reset. Even the BIOS on most PCs probably keep only one core working (and the other cores idle). So you'll need to write (assembly, non-portable) code to start other cores.
And (as James reminded you), most of the time you are using some operating system kernel. For instance, on Linux (I don't know about Windows), threads are known by the kernel (because the tasks it is scheduling are threads) and they need to be initiated by the Linux clone(2) system call.
Often, kernel threads are quite heavy, and the system has a library (NPTL for Linux Posix threads) which may use fewer kernel threads than user threads (actually Linux NPTL is a 1:1 mapping between kernel and user threads, but on some other systems, like probably Solaris, things are different).
You can't write your own threading mechanism, unless you mean pseudo-threads like co-routines and not actual concurrently executing threads. This is because the fundamental thread mechanism is defined by the kernel and you can't change it nor implement your own. Any library you write must fall back, eventually, to the operating system.

Multithreading vs multiprocessing

I am new to this kind of programming and need your point of view.
I have to build an application but I can't get it to compute fast enough. I have already tried Intel TBB, and it is easy to use, but I have never used other libraries.
In multiprocessor programming, I am reading about OpenMP and Boost for the multithreading, but I don't know their pros and cons.
In C++, when is multi threaded programming advantageous compared to multiprocessor programming and vice versa?Which is best suited to heavy computations or launching many tasks...? What are their pros and cons when we build an application designed with them? And finally, which library is best to work with?
Multithreading means exactly that, running multiple threads. This can be done on a uni-processor system, or on a multi-processor system.
On a single-processor system, when running multiple threads, the actual observation of the computer doing multiple things at the same time (i.e., multi-tasking) is an illusion, because what's really happening under the hood is that there is a software scheduler performing time-slicing on the single CPU. So only a single task is happening at any given time, but the scheduler is switching between tasks fast enough so that you never notice that there are multiple processes, threads, etc., contending for the same CPU resource.
On a multi-processor system, the need for time-slicing is reduced. The time-slicing effect is still there, because a modern OS could have hundred's of threads contending for two or more processors, and there is typically never a 1-to-1 relationship in the number of threads to the number of processing cores available. So at some point, a thread will have to stop and another thread starts on a CPU that the two threads are sharing. This is again handled by the OS's scheduler. That being said, with a multiprocessors system, you can have two things happening at the same time, unlike with the uni-processor system.
In the end, the two paradigms are really somewhat orthogonal in the sense that you will need multithreading whenever you want to have two or more tasks running asynchronously, but because of time-slicing, you do not necessarily need a multi-processor system to accomplish that. If you are trying to run multiple threads, and are doing a task that is highly parallel (i.e., trying to solve an integral), then yes, the more cores you can throw at a problem, the better. You won't necessarily need a 1-to-1 relationship between threads and processing cores, but at the same time, you don't want to spin off so many threads that you end up with tons of idle threads because they must wait to be scheduled on one of the available CPU cores. On the other hand, if your parallel tasks requires some sequential component, i.e., a thread will be waiting for the result from another thread before it can continue, then you may be able to run more threads with some type of barrier or synchronization method so that the threads that need to be idle are not spinning away using CPU time, and only the threads that need to run are contending for CPU resources.
There are a few important points that I believe should be added to the excellent answer by #Jason.
First, multithreading is not always an illusion even on a single processor - there are operations that do not involve the processor. These are mainly I/O - disk, network, terminal etc. The basic form for such operation is blocking or synchronous, i.e. your program waits until the operation is completed and then proceeds. While waiting, the CPU is switched to another process/thread.
if you have anything you can do during that time (e.g. background computation while waiting for user input, serving another request etc.) you have basically two options:
use asynchronous I/O: you call a non-blocking I/O providing it with a callback function, telling it "call this function when you are done". The call returns immediately and the I/O operation continues in the background. You go on with the other stuff.
use multithreading: you have a dedicated thread for each kind of task. While one waits for the blocking I/O call, the other goes on.
Both approaches are difficult programming paradigms, each has its pros and cons.
with async I/O the logic of the program's logic is less obvious and is difficult to follow and debug. However you avoid thread-safety issues.
with threads, the challange is to write thread-safe programs. Thread safety faults are nasty bugs that are quite difficult to reproduce. Over-use of locking can actually lead to degrading instead of improving the performance.
(coming to the multi-processing)
Multithreading made popular on Windows because manipulating processes is quite heavy on Windows (creating a process, context-switching etc.) as opposed to threads which are much more lightweight (at least this was the case when I worked on Win2K).
On Linux/Unix, processes are much more lightweight. Also (AFAIK) threads on Linux are implemented actually as a kind of processes internally, so there is no gain in context-switching of threads vs. processes. However, you need to use some form of IPC (inter-process communications), as shared memory, pipes, message queue etc.
On a more lite note, look at the SQLite FAQ, which declares "Threads are evil"! :)
To answer the first question:
The best approach is to just use multithreading techniques in your code until you get to the point where even that doesn't give you enough benefit. Assume the OS will handle delegation to multiple processors if they're available.
If you actually are working on a problem where multithreading isn't enough, even with multiple processors (or if you're running on an OS that isn't using its multiple processors), then you can worry about discovering how to get more power. Which might mean spawning processes across a network to other machines.
I haven't used TBB, but I have used IPP and found it to be efficient and well-designed. Boost is portable.
Just wanted to mention that the Flow-Based Programming ( http://www.jpaulmorrison.com/fbp ) paradigm is a naturally multiprogramming/multiprocessing approach to application development. It provides a consistent application view from high level to low level. The Java and C# implementations take advantage of all the processors on your machine, but the older C++ implementation only uses one processor. However, it could fairly easily be extended to use BOOST (or pthreads, I assume) by the use of locking on connections. I had started converting it to use fibers, but I'm not sure if there's any point in continuing on this route. :-) Feedback would be appreciated. BTW The Java and C# implementations can even intercommunicate using sockets.