Related
I'm learning for an exam but the slides given by the professor aren't very helpful, and any searches I do give only muddy results. From what I could gather:
Semaphore: counter with down() and up() operations that busy waits when calling down(counter) when counter = 0 until counter > 0 again
Mutex: binary semaphore that can only be released by its owning process/thread
Test-and-set: CPU instruction for retrieving a binary value and setting it to 1 atomically; used to implement mutexes
Monitor: object that forces synchronized access, i.e. only one process/thread can access it at a time; can be implemented using
mutexes
Message passing: processes send messages over some shared memory location to tell each other when the other can continue their work; this is effectively a semaphore that acts not only as a counter, but can also be used to exchange other data, i.e. some produce in the producer-consumer problem
I have the following questions:
Are these definitions correct?
Is there a way to implement monitors without mutexes?
Is there a way to implement mutexes without test-and-set, specifically without busy-waiting?
Are there uses for test-and-set other than to implement mutexes?
No. You may or may not be able to implement them with busy waits, but that is by no means part of the definition. The definition is Procure will return when the semaphore is available; Liberate will make the semaphore available. How they do that is up to them. There is no other definition.
Yes. Roughly speaking you can use any of { message passing, mutexes, semaphores } to implement { message passing, mutees, semaphores }. By transitivity, anything you can implement with any of these, you can implement with any other of these. But remember, you can kick a ball down a beach. A ball is an object like a whale; thus you can kick a whale down a beach. It might be a bit harder.
Yes. Ask your favourite search-and-creep service about dekker and his fabulous algorithm.
Yes. You might implement a multi-cpu state machine using test+set and test+clr, for example. Turing machines are fantastically flexible; simple abstractions like mutexes are meant to constrain that flexibility into something comprehensible.
I know that I can check the state of the std::future the following way:
my_future.wait_for(std::chrono::seconds(0)) == std::future_status::ready
But according to cppreference.com std::future::wait_for may block in some cases:
This function may block for longer than timeout_duration due to scheduling or resource contention delays.
Is it still the case when timeout_duration is 0 ? If so, is there another way to query the state in a guaranteed wait-free manner ?
The quote from cppreference is simply there to remind you that the OS scheduler is a factor here and that other tasks requiring platform resources could be using the CPU-time your thread needs in order to return from wait_for() -- regardless of the specified timeout duration being zero or not. That's all. You cannot technically be guaranteed to get more than that on a non-realtime platform. As such, the C++ Standard says nothing about this, but you can see other interesting stuff there -- see the paragraph for wait_for() under [futures.unique_future¶21]:
Effects: None if the shared state contains a deferred function
([futures.async]), otherwise blocks until the shared state is ready or
until the relative timeout ([thread.req.timing]) specified by
rel_time has expired.
No such mention of the additional delay here, but it does say that you are blocked, and it remains implementation dependent whether wait_for() is yield()ing the thread1 first thing upon such blocking or immediately returning if the timeout duration is zero. In addition, it might also be necessary for an implementation to synchronize access to the future's status in a locking manner, which would have to be applied prior to checking if a potential immediate return is to take place. Hence, you don't even have the guarantee for lock-freedom here, let alone wait-freedom.
Note that the same applies for calling wait_until with a time in the past.
Is it still the case when timeout_duration is 0 ? If so, is there
another way to query the state in a guaranteed wait-free manner ?
So yes, implementation of wait_free() notwithstanding, this is still the case. As such, this is the closest to wait-free you're going to get for checking the state.
1 In simple terms, this means "releasing" the CPU and putting your thread at the back of the scheduler's queue, giving other threads some CPU-time.
To answer your second question, there is currently no way to check if the future is ready other than waiting. We will likely get this at some point: https://en.cppreference.com/w/cpp/experimental/future/is_ready. If your runtime library supports the concurrency extensions and you don't mind using experimental in your code, then you can use is_ready() now. That being said, I know of few cases where you must check a future's state. Are you sure it's necessary?
Is it still the case when timeout_duration is 0 ?
Yes. That's true for any operation. The OS scheduler could pause the thread (or the whole process) to allow another thread to run on the same CPU.
If so, is there another way to query the state in a guaranteed wait-free manner ?
No. Using a zero timeout is the correct way.
There's not even a guarantee that the shared state of a std::future doesn't lock a mutex to check if it's ready, so it would be impossible to guarantee it was wait-free.
For GCC's implementation the ready flag is an atomic so there's no mutex lock needed, and if it's ready then wait_for returns immediately. If it's not ready then there are some more atomic operations and then a check to see if the timeout has passed already, then a system call. So for a zero timeout there are just some atomic loads and function calls (no system call).
(Note: Much of this is redundant with commentary on Massive CPU load using std::lock (c++11), but I think this topic deserves its own question and answers.)
I recently encountered some sample C++11 code that looked something like this:
std::unique_lock<std::mutex> lock1(from_acct.mutex, std::defer_lock);
std::unique_lock<std::mutex> lock2(to_acct.mutex, std::defer_lock);
std::lock(lock1, lock2); // avoid deadlock
transfer_money(from_acct, to_acct, amount);
Wow, I thought, std::lock sounds interesting. I wonder what the standard says it does?
C++11 section 30.4.3 [thread.lock.algorithm], paragraphs (4) and (5):
template void lock(L1&, L2&, L3&...);
4 Requires: Each template parameter type shall meet the Lockable
requirements, [ Note: The unique_lock class template meets these
requirements when suitably instantiated. — end note ]
5 Effects: All arguments are locked via a sequence of calls to lock(),
try_lock(), or unlock() on each argument. The sequence of calls shall
not result in deadlock, but is otherwise unspecifed. [ Note: A
deadlock avoidance algorithm such as try-and-back-off must be used, but
the specifc algorithm is not specifed to avoid over-constraining
implementations. — end note ] If a call to lock() or try_lock() throws
an exception, unlock() shall be called for any argument that had been
locked by a call to lock() or try_lock().
Consider the following example. Call it "Example 1":
Thread 1 Thread 2
std::lock(lock1, lock2); std::lock(lock2, lock1);
Can this deadlock?
A plain reading of the standard says "no". Great! Maybe the compiler can order my locks for me, which would be kind of neat.
Now try Example 2:
Thread 1 Thread 2
std::lock(lock1, lock2, lock3, lock4); std::lock(lock3, lock4);
std::lock(lock1, lock2);
Can this deadlock?
Here again, a plain reading of the standard says "no". Uh oh. The only way to do that is with some kind of back-off-and-retry loop. More on that below.
Finally, Example 3:
Thread 1 Thread 2
std::lock(lock1,lock2); std::lock(lock3,lock4);
std::lock(lock3,lock4); std::lock(lock1,lock2);
Can this deadlock?
Once again, a plain reading of the standard says "no". (If the "sequence of calls to lock()" in one of these invocations is not "resulting in deadlock", what is, exactly?) However, I am pretty sure this is unimplementable, so I suppose it's not what they meant.
This appears to be one of the worst things I have ever seen in a C++ standard. I am guessing it started out as an interesting idea: Let the compiler assign a lock ordering. But once the committee chewed it up, the result is either unimplementable or requires a retry loop. And yes, that is a bad idea.
You can argue that "back off and retry" is sometimes useful. That is true, but only when you do not know which locks you are trying to grab up front. For example, if the identity of the second lock depends on data protected by the first (say because you are traversing some hierarchy), then you might have to do some grab-release-grab spinning. But in that case you cannot use this gadget, because you do not know all of the locks up front. On the other hand, if you do know which locks you want up front, then you (almost) always want simply to impose an ordering, not to loop.
Also, note that Example 1 can live-lock if the implementation simply grabs the locks in order, backs off, and retries.
In short, this gadget strikes me as useless at best. Just a bad idea all around.
OK, questions. (1) Are any of my claims or interpretations wrong? (2) If not, what the heck were they thinking? (3) Should we all agree that "best practice" is to avoid std::lock completely?
[Update]
Some answers say I am misinterpreting the standard, then go on to interpret it the same way I did, then confuse the specification with the implementation.
So, just to be clear:
In my reading of the standard, Example 1 and Example 2 cannot deadlock. Example 3 can, but only because avoiding deadlock in that case is unimplementable.
The entire point of my question is that avoiding deadlock for Example 2 requires a back-off-and-retry loop, and such loops are extremely poor practice. (Yes, some sort of static analysis on this trivial example could make that avoidable, but not in the general case.) Also note that GCC implements this thing as a busy loop.
[Update 2]
I think a lot of the disconnect here is a basic difference in philosophy.
There are two approaches to writing software, especially multi-threaded software.
In one approach, you throw a bunch of stuff together and run it to see how well it works. You are never convinced that your code has a problem unless someone can demonstrate that problem on a real system, right now, today.
In the other approach, you write code that can be rigorously analyzed to prove that it has no data races, that all of its loops terminate with probability 1, and so forth. You perform this analysis strictly within the machine model guaranteed by the language spec, not on any particular implementation.
Advocates of the latter approach are not impressed by any demonstrations on particular CPUs, compilers, compiler minor versions, operating systems, runtimes, etc. Such demonstrations are barely interesting and totally irrelevant. If your algorithm has a data race, it is broken, no matter what happens when you run it. If your algorithm has a livelock, it is broken, no matter what happens when you run it. And so forth.
In my world, the second approach is called "Engineering". I am not sure what the first approach is called.
As far as I can tell, the std::lock interface is useless for Engineering. I would love to be proven wrong.
I think you are misunderstanding the scope of the deadlock avoidance. That's understandable since the text seems to mention lock in two different contexts, the "multi-lock" std::lock and the individual locks carried out by that "multi-lock" (however the lockables implement it). The text for std::lock states:
All arguments are locked via a sequence of calls to lock(), try_lock(),or unlock() on each argument. The sequence of calls shall not result in deadlock
If you call std::lock passing ten different lockables, the standard guarantees no deadlock for that call. It's not guaranteed that deadlock is avoided if you lock the lockables outside the control of std::lock. That means thread 1 locking A then B can deadlock against thread 2 locking B then A. That was the case in your original third example, which had (pseudo-code):
Thread 1 Thread 2
lock A lock B
lock B lock A
As that couldn't have been std::lock (it only locked one resource), it must have been something like unique_lock.
The deadlock avoidance will occur if both threads attempt to lock A/B and B/A in a single call to std::lock, as per your first example. Your second example won't deadlock either since thread 1 will be backing off if the second lock is needed by a thread 2 already having the first lock. Your updated third example:
Thread 1 Thread 2
std::lock(lock1,lock2); std::lock(lock3,lock4);
std::lock(lock3,lock4); std::lock(lock1,lock2);
still has the possibility of deadlock since the atomicity of the lock is a single call to std::lock. For example, if thread 1 successfully locks lock1 and lock2, then thread 2 successfully locks lock3 and lock4, deadlock will ensue as both threads attempt to lock the resource held by the other.
So, in answer to your specific questions:
1/ Yes, I think you've misunderstood what the standard is saying. The sequence it talks about is clearly the sequence of locks carried out on the individual lockables passed to a single std::lock.
2/ As to what they were thinking, it's sometimes hard to tell :-) But I would posit that they wanted to give us capabilities that we would otherwise have to write ourselves. Yes, back-off-and-retry may not be an ideal strategy but, if you need the deadlock avoidance functionality, you may have to pay the price. Better for the implementation to provide it rather than it having to be written over and over again by developers.
3/ No, there's no need to avoid it. I don't think I've ever found myself in a situation where simple manual ordering of locks wasn't possible but I don't discount the possibility. If you do find yourself in that situation, this can assist (so you don't have to code up your own deadlock avoidance stuff).
In regard to the comments that back-off-and-retry is a problematic strategy, yes, that's correct. But you may be missing the point that it may be necessary if, for example, you cannot enforce the ordering of the locks before-hand.
And it doesn't have to be as bad as you think. Because the locks can be done in any order by std::lock, there's nothing stopping the implementation from re-ordering after each backoff to bring the "failing" lockable to the front of the list. That would mean those that were locked would tend to gather at the front, so that the std::lock would be less likely to be claiming resources unnecessarily.
Consider the call std::lock (a, b, c, d, e, f) in which f was the only lockable that was already locked. In the first lock attempt, that call would lock a through e then "fail" on f.
Following the back-off (unlocking a through e), the list to lock would be changed to f, a, b, c, d, e so that subsequent iterations would be less likely to unnecessarily lock. That's not fool-proof since other resources may be locked or unlocked between iterations, but it tends towards success.
In fact, it may even order the list initially by checking the states of all lockables so that all those currently locked are up the front. That would start the "tending toward success" operation earlier in the process.
That's just one strategy, there may well be others, even better. That's why the standard didn't mandate how it was to be done, on the off-chance there may be some genius out there who comes up with a better way.
Perhaps it would help if you thought of each individual call to std::lock(x, y, ...) as atomic. It will block until it can lock all of its arguments. If you don't know all of the mutexes you need to lock a-priori, do not use this function. If you do know, then you can safely use this function, without having to order your locks.
But by all means order your locks if that is what you prefer to do.
Thread 1 Thread 2
std::lock(lock1, lock2); std::lock(lock2, lock1);
The above will not deadlock. One of the threads will get both locks, and the other thread will block until the first one has released the locks.
Thread 1 Thread 2
std::lock(lock1, lock2, lock3, lock4); std::lock(lock3, lock4);
std::lock(lock1, lock2);
The above will not deadlock. Though this is tricky. If Thread 2 gets lock3 and lock4 before Thread1 does, then Thread 1 will block until Thread 2 releases all 4 locks. If Thread 1 gets the four locks first, then Thread 2 will block at the point of locking lock3 and lock4 until Thread 1 releases all 4 locks.
Thread 1 Thread 2
std::lock(lock1,lock2); std::lock(lock3,lock4);
std::lock(lock3,lock4); std::lock(lock1,lock2);
Yes, the above can deadlock. You can view the above as exactly equivalent to:
Thread 1 Thread 2
lock12.lock(); lock34.lock();
lock34.lock(); lock12.lock();
Update
I believe a misunderstanding is that dead-lock and live-lock are both correctness issues.
In actual practice, dead-lock is a correctness issue, as it causes the process to freeze. And live-lock is a performance issue, as it causes the process to slow down, but it still completes its task correctly. The reason is that live-lock will not (in practice) sustain itself indefinitely.
<disclaimer>
There are forms of live-lock that can be created which are permanent, and thus equivalent to dead-lock. This answer does not address such code, and such code is not relevant to this issue.
</disclaimer>
The yield shown in this answer is a significant performance optimization which significantly decreases live-lock, and thus significantly increases the performance of std::lock(x, y, ...).
Update 2
After a long delay, I have written a first draft of a paper on this subject. The paper compares 4 different ways of getting this job done. It contains software you can copy and paste into your own code and test yourself:
http://howardhinnant.github.io/dining_philosophers.html
Your confusion with the standardese seems to be due to this statement
5 Effects: All arguments are locked via a sequence of calls to lock(),
try_lock(), or unlock() on each argument.
That does not imply that std::lock will recursively call itself with each argument to the original call.
Objects that satisfy the Lockable concept (§30.2.5.4 [thread.req.lockable.req]) must implement all 3 of those member functions. std::lock will invoke these member functions on each argument, in an unspecified order, to attempt to acquire a lock on all objects, while doing something implementation defined to avoid deadlock.
Your example 3 has a potential for deadlock because you're not issuing a single call to std::lock with all objects that you want to acquire a lock on.
Example 2 will not cause a deadlock, Howard's answer explains why.
Did C++11 adopt this function from Boost?
If so, Boost's description is instructive (emphasis mine):
Effects: Locks the Lockable objects supplied as arguments in an unspecified and
indeterminate order in a way that avoids deadlock. It is safe to call
this function concurrently from multiple threads with the same mutexes
(or other lockable objects) in different orders without risk of
deadlock. If any of the lock() or try_lock() operations on the
supplied Lockable objects throws an exception any locks acquired by
the function will be released before the function exits.
If I lock a std::mutex will I always get a memory fence? I am unsure if it implies or enforces you to get the fence.
Update:
Found this reference following up on RMF's comments.
Multithreaded programming and memory visibility
As I understand this is covered in:
1.10 Multi-threaded executions and data races
Para 5:
The library defines a number of atomic operations (Clause 29) and operations on mutexes (Clause 30)
that are specially identified as synchronization operations. These operations play a special role in making assignments in one thread visible to another. A synchronization operation on one or more memory locations is either a consume operation, an acquire operation, a release operation, or both an acquire and release operation. A synchronization operation without an associated memory location is a fence and can be either an acquire fence, a release fence, or both an acquire and release fence. In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics. [Note: For example, a call that acquires a mutex will perform an acquire operation on the locations comprising the mutex. Correspondingly, a call that releases the same mutex will perform a release operation on those same locations. Informally, performing a release operation on A forces prior side effects on other memory locations to become visible to other threads that later perform a consume or an acquire operation on A. “Relaxed” atomic operations are not synchronization operations even though, like synchronization operations, they cannot contribute to data races. —end note]
Unlocking a mutex synchronizes with locking the mutex. I don't know what options the compiler has for the implementation, but you get the same effect of a fence.
A mutex operation (lock or unlock) on a particular mutex M is only useful for any purpose related to synchronization or memory visibility if M is shared by different threads and they perform these operations. A mutex defined locally and only used by one thread does not provide any meaningful synchronization.
[Note: The optimizations I describe here are probably not done by many compilers, which might view these mutex and atomic synchronization operations as "black boxes" that cannot be optimized (or even that should not optimized in order to preserve the predictability of code generation, and some particular patterns, which is a bogus argument). I would not be surprised if zero compiler did the optimization in even the simpler case but there is no doubt that they are legal.]
A compiler can easily determine that some variables are never used by multiple threads (or any asynchronous execution), notably for an automatic variable whose address is not taken (nor a reference to it). Such objects are called here "thread private". (All automatic variables candidate for register allocation are thread private.)
For a thread private mutex, no meaningful code needs to be generated for lock/unlock operations: no atomic compare and exchange, no fencing, and often no state needs to be kept at all, except for the case of "safe mutex" where the behavior of recursive locking is well defined and should fail (to make the sequence m.lock(); bool locked = m.try_lock(); work you need to keep at least a boolean state).
This is also true for any thread private atomic objects: only the naked non atomic type is needed and normal operations can be performed (so a fetch-add 1 becomes as regular post increment).
The reason why these transformations are legal:
the obvious observation that if an object is accessed by only of thread or parallel execution (they aren't even accessed by an asynchronous signal handler) so the use of non atomic operation in assembly is not detectable by any mean
the less obvious remark that no ordering/memory visibility is implied by any use of thread private synchronization object.
All synchronization objects are specified as tools for inter thread communication: they can guarantee that the side effects in one thread are visible in another thread; they cause a well defined order of operations to exist not just in one thread (the sequential order of execution of operations of one thread) but in multiple threads.
A common example is the publication of an information with an atomic pointer type:
The shared data is:
atomic<T*> shared; // null pointer by default
The publishing thread does:
T *p = new T;
*p = load_info();
shared.store(p, memory_order_release);
The consuming thread can check whether the data is available by loading the atomic object value, as a consumer:
T *p = shared.load(memory_order_acquire);
if (p) use *p;
(There is no defined way to wait for availability here, it's a simple example to illustrate publication and consumption of the published value.)
The publishing thread only needs to set the atomic variable after it has finished the initialization of all fields; the memory order is a release to communicate the fact the memory manipulations are finished.
The other threads only need an acquire memory order to "connect" with the release operation if there was one. If the value is still zero, the thread has learn nothing about the world and the acquire is meaningless; it can't act on it. (By the time the thread check the pointer and see a null, the shared variable might have been changed already. It doesn't matter as the designer considered that not having a value in that thread is manageable, or it would have performed the operation in a sequence.)
All atomic operations are intended to be possibly lock less, that is to finish in a short finite time whatever other threads are doing, and even if they are stuck. It means that you can't depend on another thread having finished a job.
At the other end of the spectrum of thread communication primitives, mutexes don't hold a value that can be used to carry information between threads (*) but they ensure that one thread can enter a lock-ops-unlock sequence only after another thread has finished his own lock-ops-unlock sequence.
(*) not even a boolean value, as using a mutex to as a general boolean signal (= binary semaphore) between threads is specifically prohibited
A mutex is always used in connection with a set of shared variables: the protected variables or objects V; these V are used to carry information between threads and the mutex make access to that information well ordered (or serialized) between threads. In technical terms, all but the first mutex lock (on M) operation pair with previous unlock operation on M:
a lock of M is an acquire operation on M
an unlock of M is a release operation on M
The semantic of locking/unlocking is defined on a single M so let's stop repeating "on M"; we have threads A and B. The lock by B is an acquire that pairs with the unlock by A. Both operations together form an inter thread synchronization.
[What about a thread that happens to lock M often and will often re-lock M without any other thread acting on M in the meantime? Nothing interesting, the acquire is still paired with a release but A = B so nothing is accomplished. The unlock was sequenced in the same thread of execution so it's pointless in that particular case, but in general a thread can't tell that it's pointless. It isn't even special cased by the language semantic.]
The synchronization that occurs is between the set of threads T acting on a mutex: no other thread is guaranteed to be able to view any memory operation performed by these T. Note that in practice on most real computers, once a memory modification hits the cache, all CPU will view it if they check that same address, by the power of the cache consistency. But C/C++ threads(#) are not specified in term of a globally consistent cache, and not in term of ordering visible on a CPU, as the compiler itself can assume that non atomic objects are not mutated in arbitrary ways by the program without synchronization (the CPU cannot assume any such thing as it doesn't a concept of atomic vs. non atomic memory locations). That means that a guarantee available at the CPU/memory system you are targeting is not in general available at the C/C++ high level model. You absolutely cannot use normal C/C++ code as a high level assembly; only by dousing your code with volatile (almost everywhere) can you even vaguely approach high level assembly (but not quite).
(#) "C/C++ thread/semantics" not "C/C++ programming language thread semantics": C and C++ are based on the same specification for synchronization primitives, that doesn't mean that there is a C/C++ language)
Since the effect of mutex operations on M is only to serialize access to some data by threads that use M, it's clear that other threads don't see any effect. In technical terms, the synchronize with relation is between threads using the same synchronization objects (mutexes in that context, same atomic objects in the context of atomic use).
Even when the compiler emits memory fences in assembly language, it doesn't have to assume that an unlock operation makes changes performed before the unlock to threads outside the set T.
That allows the decomposition of sets of threads for program analysis: if a programs runs in parallel two sets of threads U and V, and U and V are created such that U and V can't access any common synchronization object (but they can access common non atomic objects), then you can analyse the interactions of U and of V separately from the point of view of thread semantics, as U and V cannot exchange information in well defined inter threads ways (they can still exchange information via the system, for example via disk files, sockets, for system specific shared memory).
(That observation might allow the compiler to optimize some threads without doing a full program analysis, even if some common mutable objects are "pulled it" via a third party class that has static members.)
Another way to explain that is to say that the semantics of these primitives is not leaking: only those threads that participate get a defined result.
Note that this is only true at the specification level for acquire and release operations, not sequentially consistent operations (which is the default order for operations on atomic object with you don't specify a memory order): all sequentially consistent actions (operations on an atomic object or fences) occur in a well defined global order. That however doesn't mean anything for independent threads having no atomic objects in common.
An order of operations is unlike an order of elements in a container where you can really navigate the container, or saying that files are presented as ordered by names. Only objects are observable, the orders of operations are not. Saying that there is a well defined order only means that values don't appear to have provably changed backward (vs. some abstract order).
If you have two unrelated sets that are ordered, say the integers with usual order and the words with lexicographical order), you can define the sum of these sets as having an order compatible with both orders. You might put the numbers before, after, or alternated with the words. You are free to do what you want because the elements in the sum of two sets doesn't have any relation with each other when they don't come from the same set.
You could say that there is a global order of all mutex operations, it's just not useful, like defining the order of the sum of unrelated sets.
There is a widely known way of locking multiple locks, which relies on choosing fixed linear ordering and aquiring locks according to this ordering.
That was proposed, for example, in the answer for "Acquire a lock on two mutexes and avoid deadlock". Especially, the solution based on address comparison seems to be quite elegant and obvious.
When I tried to check how it is actually implemented, I've found, to my surprise, that this solution in not widely used.
To quote the Kernel Docs - Unreliable Guide To Locking:
Textbooks will tell you that if you always lock in the same order, you
will never get this kind of deadlock. Practice will tell you that this
approach doesn't scale: when I create a new lock, I don't understand
enough of the kernel to figure out where in the 5000 lock hierarchy it
will fit.
PThreads doesn't seem to have such a mechanism built in at all.
Boost.Thread came up with
completely different solution, lock() for multiple (2 to 5) mutexes is based on trying and locking as many mutexes as it is possible at the moment.
This is the fragment of the Boost.Thread source code (Boost 1.48.0, boost/thread/locks.hpp:1291):
template<typename MutexType1,typename MutexType2,typename MutexType3>
void lock(MutexType1& m1,MutexType2& m2,MutexType3& m3)
{
unsigned const lock_count=3;
unsigned lock_first=0;
for(;;)
{
switch(lock_first)
{
case 0:
lock_first=detail::lock_helper(m1,m2,m3);
if(!lock_first)
return;
break;
case 1:
lock_first=detail::lock_helper(m2,m3,m1);
if(!lock_first)
return;
lock_first=(lock_first+1)%lock_count;
break;
case 2:
lock_first=detail::lock_helper(m3,m1,m2);
if(!lock_first)
return;
lock_first=(lock_first+2)%lock_count;
break;
}
}
}
where lock_helper returns 0 on success and number of mutexes that weren't successfully locked otherwise.
Why is this solution better, than comparing addresses or any other kind of ids? I don't see any problems with pointer comparison, which can be avoided using this kind of "blind" locking.
Are there any other ideas on how to solve this problem on a library level?
From the bounty text:
I'm not even sure if I can prove correctness of the presented Boost solution, which seems more tricky than the one with linear order.
The Boost solution cannot deadlock because it never waits while already holding a lock. All locks but the first are acquired with try_lock. If any try_lock call fails to acquire its lock, all previously acquired locks are freed. Also, in the Boost implementation the new attempt will start from the lock failed to acquire the previous time, and will first wait till it is available; it's a smart design decision.
As a general rule, it's always better to avoid blocking calls while holding a lock. Therefore, the solution with try-lock, if possible, is preferred (in my opinion). As a particular consequence, in case of lock ordering, the system at whole might get stuck. Imagine the very last lock (e.g. the one with the biggest address) was acquired by a thread which was then blocked. Now imagine some other thread needs the last lock and another lock, and due to ordering it will first get the other one and will wait on the last lock. Same can happen with all other locks, and the whole system makes no progress until the last lock is released. Of course it's an extreme and rather unlikely case, but it illustrates the inherent problem with lock ordering: the higher a lock number the more indirect impact the lock has when acquired.
The shortcoming of the try-lock-based solution is that it can cause livelock, and in extreme cases the whole system might also get stuck for at least some time. Therefore it is important to have some back-off schema that make pauses between locking attempts longer with time, and perhaps randomized.
Sometimes, lock A needs to be acquired before lock B does. Lock B might have either a lower or a higher address, so you can't use address comparison in this case.
Example: When you have a tree data-structure, and threads try to read and update nodes, you can protect the tree using a reader-writer lock per node. This only works if your threads always acquire locks top-down root-to-leave. The address of the locks does not matter in this case.
You can only use address comparison if it does not matter at all which lock gets acquired first. If this is the case, address comparison is a good solution. But if this is not the case you can't do it.
I guess the Linux kernel requires certain subsystems to be locked before others are. This cannot be done using address comparison.
The "address comparison" and similar approaches, although used quite often, are special cases. They works fine if you have
a lock-free mechanism to get
two (or more) "items" of the same kind or hierarchy level
any stable ordering schema between those items
For example: You have a mechanism to get two "accounts" from a list. Assume that the access to the list is lock-free. Now you have pointers to both items and want to lock them. Since they are "siblings" you have to choose which one to lock first. Here the approach using addresses (or any other stable ordering schema like "account id") is OK.
But the linked Linux text talks about "lock hierarchies". This means locking not between "siblings" (of the same kind) but between "parent" and "children" which might be from different types. This may happen in actual tree structures as well in other scenarios.
Contrived example: To load a program you must
lock the file inode,
lock the process table
lock the destination memory
These three locks are not "siblings" not in a clear hierarchy. The locks are also not taken directly one after the other - each subsystem will take the locks at free will. If you consider all usecases where those three (and more) subsystems interact you see, that there is no clear, stable ordering you can think of.
The Boost library is in the same situation: It strives to provide generic solutions. So they cannot assume the points from above and must fall back to a more complicated strategy.
One scenario when address compare will fail is if you use the proxy pattern.
You can delegate the locks to the same object and the addresses will be different.
Consider the following example
template<typename MutexType>
class MutexHelper
{
MutexHelper(MutexType &m) : _m(m) {}
void lock()
{
std::cout <<"locking ";
m.lock();
}
void unlock()
{
std::cout <<"unlocking ";
m.unlock();
}
MutexType &_m;
};
if the function
template<typename MutexType1,typename MutexType2,typename MutexType3>
void lock(MutexType1& m1,MutexType2& m2,MutexType3& m3);
will actually use address compare the following code ca produce a deadlock
Mutex m1;
Mutex m1;
thread1
MutexHelper hm1(m1);
MutexHelper hm2(m2);
lock(hm1, hm2);
thread2:
MutexHelper hm2(m2);
MutexHelper hm1(m1);
lock(hm1, hm2);
EDIT:
this is an interesting thread that share some light on boost::lock implementation
thread-best-practice-to-lock-multiple-mutexes
Address compare does not work for inter-process shared mutexes (named synchronization objects).