Can every imaginable synchronization problem be solved with judicious use of semaphores? What about weak semaphores?
No. Just for example, it's impossible for a system that uses only semaphores for synchronization to provide wait-free guarantees, or even progress guarantees, in the face of third-party code (e.g. a plugin). A perverse or poorly-written section of code can deny access to a semaphore-guarded section of code to everyone forever.
Agerwala argues that appropiately extended semaphores are complete. This doesn't answer all my questions, but is on right track. David Seiler has a point too.
Related
Before explaining the question in more detail, I'll note that the answer is obviously implementation dependent, so I'm primarily asking about libstdc++, but I would also be interested in hearing about libc++. The operating system is Linux.
Calling wait() or get() on a std::future blocks until the result is set by an asynchronous operation -- either an std::promise, std::packaged_task, or std::asyn function. The availability of the result is communicated through a shared state, which is basically an atomic variable: the future waits for the shared state to be marked as ready by the promise (or async task). This waiting and notification is implemented in libstdc++ via a futex system call. Granted that futexes are highly performant, in situations where a future expects to only wait for an extremely brief period (on the order of single microseconds), it would seem that a performance gain could be made by spinning on the shared state for a short time before proceeding to wait on the futex.
I have not found any evidence of such spinning in the current implementation, however, I did find a comment in atomic_futex.h at line 161 where I would expect to find such spinning:
// TODO Spin-wait first.
So my question is rather the following: Are there really plans to implement a spin-wait, and if so, how will the duration be decided? Additionally, is this the type of functionality that could eventually be specified through a policy to the future?
I will answer the question: Does std::future::get() perform a spin-wait?
The answer for all of C++ is: It is an implementation detail. A conforming standard library might spin or it might not (in the same vein, std::mutex::lock() is allowed to spin). Will there be a mechanism for specifying if and how to spin in the future? The places to look are in std::experimental::future (coming to a full version of the Standard Library soon), boost::future (proving ground for what might later go into the standard), and hpx::future (performance-focused library with advanced facilities for future management). None of these have mechanisms for explicit specification of spinning, nor has there been discussion in meeting minutes that I know of nor on the ISO CPP mailing list. It is safe to say that something like a get_with_spins function is not in the pipeline.
To answer for libstdc++ (and libc++): They also do not spin. Aside from the TODO which came from the original patch, it does not look like there is any plan to change this. I've searched the GCC mailing list for mentions of changing this behavior, but found none. Doing a pre-sleep spin can hurt in the general case (if none of the get()s have a value, you've wasted a lot of CPU cycles), so a change here could have negative impacts.
To summarize: Implementations do not appear to spin now and there appear to be no plans to change the behavior in the near future, but that can change at any moment.
I know Boost has support for mutexes and lock_guard, which can be used to implement critical sections.
But Windows has a special API for critical sections (see EnterCriticalSection and LeaveCriticalSection) which is a LOT faster than a mutex (for rarely contended, short sections of code).
Hence my question - it is possible in Boost to take advantage of this API, and fallback to spinlock/mutex/futex-based implementation on other platforms?
The simple answer is no.
Here's some relevant background from an old mailing list thread:
BTW. I am agree that mutex is more universal solution from a
performance point of view. But to be fair - CS are faster in simple
design. I believe that possibility to support them should be at
least
taken in account.
This was the article that someone pointed me to. The conclusion was
that CS are only faster if:
There are less than 8 threads total in the process.
You weren't running in the background.
You weren't on an dual processor machine.
To me this means that simple testing yields good CS performance
results, but any real world program is better off with a full blown
mutex.
I'm not adverse to supporting a CS implementation. However, I
originally chose not to for the following reasons:
You get either construction and destruction hits from using a PIMPL
idiom or you must include Windows.h in the Boost.Threads headers,
which I simply don't want to do. (This can be worked around by
emulating a CS ala OPTEX from the MSDN.)
According to this research paper most programs won't benefit from
a CS design.
It's trivial to code a (non-portable) critical_section class that
follows the Mutex model if you truly can make use of this.
For now I think I've made the right choice, though down the road we
may change the implementation to use a critical section or OPTEX.
Bill Kempf
Speaking as someone who helps out maintaining Boost.Thread, and as someone who failed to get an event object into Boost.Thread, I don't think critical sections have ever been added nor would be added to Boost for these reasons:
A Win32 critical section is trivially easy to build using a boost::atomic and a boost::condition_variable, so much so it isn't really worth having an official one. Here is probably the most complex one you could imagine, but extremely configurable including being constexpr ready (don't ask!): https://github.com/ned14/boost.outcome/blob/master/include/boost/outcome/v1/spinlock.hpp#L331
You can build your own simply by matching (Basic)Lockable concept and using atomic compare_exchange (non-x86/x64) or atomic exchange (x86/x64) and then grab it using a lock_guard around the critical section.
Some may object that a win32 critical section is not this. I am afraid it is: it simply spins on an atomic for a spin count, and then lazily tries to allocate a win32 event object which it then waits upon. Nothing special.
As much as you might think critical sections (really user mode mutexes) are better/faster/whatever, they probably are not as great as you might think. boost::mutex is a big vast heavyweight thing on Windows internally using a win32 semaphore as the kernel wait object because of the need to emulate thread cancellation and to behave well in a general purpose use context. It's easy to write a concurrency structure which is faster than another for some single use case, but it is very very hard to write a concurrency structure which is all of:
Faster than a standard implementation in the uncontended case.
Faster than a standard implementation in the lightly contended case.
Faster than a standard implementation in the heavily contended case.
Even if you manage all three of the above, that still isn't enough: you also need some guarantees on worst case progression ordering, so whether certain patterns of locks, waits and unlocks produce predictable outcomes. This is why threading facilities can appear to look slow in narrow use case scenarios, so Boost.Thread much as the STL can appear to be much slower than hand rolled locking code in say an uncontended use case.
Boost.Thread already does substantial work in user mode to avoid going to kernel sleep on Windows. On POSIX any of the major pthreads implementations also does substantial work to avoid kernel sleeps and hence Boost.Thread doesn't replicate that work. In other words, critical sections don't gain you anything in terms of scaling to load behaviours, though inevitably Boost.Thread v4 especially on Windows does a ton load of work a naive implementation does not (the planned rewrite of Boost.Thread is vastly more efficient on Windows as it can assume Windows Vista or above).
So, it looks like the default Boost mutex doesn't support it, but asio::detail::mutex does.
So I ended up using that:
#include <boost/asio/detail/mutex.hpp>
#include <boost/thread.hpp>
using boost::asio::detail::mutex;
using boost::lock_guard;
int myFunc()
{
static mutex mtx;
lock_guard<mutex> lock(mtx);
. . .
}
I am currently building a thin C++ wrapper around pthreads for in-house use. Windows as well as QNX are targeted and fortunately the pthreads-win32 ports seems to work very well, while QNX is conformant to POSIX for our practical purposes.
Now, while implementing semaphores, I hit the function
sem_post_multiple(sem_t*, int)
which is apparently only available on pthreads-win32 but is missing from QNX. As the name suggests the function is supposed to increment the semaphore by the count given as second argument. As far as I can tell the function is not part of neither POSIX 1b nor POSIX 1c.
Although there is currently no requirement for said function I am still wondering why pthreads-win32 provides the function and whether it might be of use. I could try to mimic it for QNX using similar to the following:
sem_post_multiple_qnx(sem_t* sem, int count)
{
for(;count > 0; --count)
{
sem_post(sem);
}
}
What I am asking for are suggestion/advice on how to proceed. If consensus suggests to do implement the function for QNX I would also appreciate comments on whether the suggested code snipped is a viable solution.
Thanks in advance.
PS: I deliberately left out my fancy C++ class for clarity. For all folks suggesting boost to the rescue: it is not an option in my current project due to management reasons.
In any case semaphores are an optional extension in POSIX. E.g OS X doesn't seem to implement them fully. So if you are concerned with portability, you'd have to provide wrappers of the functionalities you need, anyhow.
Your approach to emulate an atomic increment by iterated sem_post has certainly downsides.
It might be performing badly,
where usually sem_t are used in
performance critical contexts.
This operation would not be
atomic. So confusing things might
happen before you finish the loop.
I would stick to the just necessary, strictly POSIX conforming. Beware that sem_timedwait is yet another optional part of the semaphore option.
Your proposed implementation of sem_post_multiple doesn't play nicely with sem_getvalue, since sem_post_multiple is an atomic increase and therefore it's not possible for a "simultaneous" call to sem_getvalue to return any of the intermediate values.
Personally I'd want to leave them both out: trying to add fundamental synchronization operations to a system which lacks them is a mug's game, and your wrapper might soon cease to be "thin". So don't get into it unless you have code that uses sem_post_multiple, that you absolutely have to port.
sem_post_multiple() is a non-standard helper function introduced by the win32-pthreads maintainers. Your implementation isn't the same as theirs because the multiple decrements aren't atomic. Whether or not this is a problem depends on the intended use. (Personally, I wouldn't try to implement this function unless/until the need arises.)
This is an interesting question. +1.
I agree with the current prevailing consensus here that it is probably not a good idea to implement that function. While your proposed implementation would probably be work just fine in most situations, there are definitely conditions in which the results could be dramatically different due to the non-atomicity. The following is one (extremely) contrived situation:
Start thread A which calls sem_post_multiple( s, 10 )
Thread B waiting on s is released. Thread B kills thread A.
In the above unfriendly scenario, the atomic version would have incremented the semaphore by 10. With non-atomic version, it may only be incremented once. This example is certainly not likely in the real world. For example, the killing of a thread is almost always a bad idea not to mention the fact that it could leave the semaphore object in an invalid state. The Win32 implementation could leave a mutex lock on the semaphore - see this for why.
This is an interview question. How do you implement a read/write mutex? There will be multiple threads reading and writing to a resource. I'm not sure how to go about it. If there's any information needed, please let me know.
Update: I'm not sure if my statement above is valid/understandable. But what I really want to know is how do you implement multiple read and multiple writes on a single object in terms of mutex and other synchronization objects needed?
Check out Dekker's algorithm.
Dekker's algorithm is the first known
correct solution to the mutual
exclusion problem in concurrent
programming. The solution is
attributed to Dutch mathematician Th.
J. Dekker by Edsger W. Dijkstra in his
manuscript on cooperating sequential
processes. It allows two threads to
share a single-use resource without
conflict, using only shared memory for
communication.
Note that Dekker's algorithm uses a spinlock (not a busy waiting) technique.
(Th. J. Dekker's solution, mentioned by E. W. Dijkstra in his EWD1303 paper)
The short answer is that it is surprisingly difficult to roll your own read/write lock. It's very easy to miss a very subtle timing problem that could result in deadlock, two threads both thinking they have an "exclusive" lock, etc.
In a nutshell, you need to keep a count of how many readers are active at any particular time. Only when the number of active readers is zero, should you grant a thread write access. There are some design choices as to whether readers or writers are given priority. (Often, you want to give writers the priority, on the assumption that writing is done less frequently.) The (surprisingly) tricky part is to ensure that no writer is given access when there are readers, or vice versa.
There is an excellent MSDN article, "Compound Win32 Synchronization Objects" that takes you through the creation of a reader/writer lock. It starts simple, then grows more complicated to handle all the corner cases. One thing that stood out was that they showed a sample that looked perfectly good-- then they would explain why it wouldn't actually work. Had they not pointed out the problems, you might have never noticed. Well worth a read.
Hope this is helpful.
This sounds like an rather difficult question for an interview; I would not "implement" a read/write mutex, in the sense of writing one from scratch--there are much better off-the-shelf solutions available. The sensible real world thing would be to use an existing mutex type. Perhaps what they really wanted to know was how you would use such a type?
Afaik you need either an atomic compare-and-swap instruction, or you need to be able to disable interrupts. See Compare-and-swap on wikipedia. At least, that's how an OS would implement it. If you have an operating system, stand on it's shoulders, and use an existing library (boost for example).
How do i design a multithreaded C program to avoid Semaphore Mutex Concurrency
There are a few ways.
The best way is to eliminate all shared data between threads. While this isn't always practical, it's always good to eliminate as much shared data as possible.
After that, you need to start looking into lockless programming. Lockless programming is a bit of a fad right now, but the dirty secret is that it's often a much better idea to use lock-based concurrency like mutexes and semaphores. Lockless programming is very hard to get correct. Look up Herb Sutter's articles on the subject, or the wikipedia page. There are a lot of good resources about lockless synchronization out there.
Somewhere in between is critical sections. If you're programming on Windows, critical sections should be preferred to mutexes as they do some work to avoid the overhead of full mutex locks and unlocks. Try those out first, and if your performance is unacceptable (or you're targeting platforms without critical sections), then you can look into lockless techniques.
It would be best to program lock less code. It's hard, but possible.
GCC Atomic Builtins
Article by Andrei Alexandrescu (C++)
Be sure to pass on data structures between threads, always knowing which thread the data exclusively belongs to. If you use (as mentioned by Dan before) e.q. lockless queues to pass your data around, you shouldn't run into too many concurrency issues (as your code behaves much more like any other code waiting for some data to arrive).
If you are, however, migrating single- to multithreaded code - this is an entirely different beast. It's very hard. And most of the time there are no elegant solutions.
Also, have a look at InterlockedXXX series to perform atomic operation in multi-threaded environment.
InterlockedIncrement Function
InterlockedDecrement Function
InterlockedExchange Function