I am trying to figure out the relationship between shared memory based concurrency algorithms (Peterson's / Bakery) and the use of semaphores and mutexes.
In the first case, we have a system without OS intervention, and processes can synchronize themselves using shared memory and busy waiting.
In the second case, the OS provides processes/threads with the ability to block, and not have to busy wait.
Is there ever a situation where we'd like to use shared memory in addition to semaphores (to ensure fairness / lack of starvation), or does the OS offer a better way to do this?
(I am wondering about the general concepts, but answers specific to POSIX/Win32/JAVA threads are also interesting).
Thanks a lot!
I can't think of any circumstances where what you actually want is a busy wait. Busy waiting just consumes processor time without achieving anything. That's not to say that "busy wait" algorithms aren't useful (they are), but the "busy wait" part is not the desired property, it is just a necessary consequence of a property that is desired.
Peterson's lock algorithm and Lamport's bakery algorithm are fundamentally just implementations of the mutex concept. OS facilities provide implementations of the same concept, but with different trade-offs.
The "ideal" implementation of a mutex would have "zero overhead" --- acquiring a lock on a mutex would not take any time at all if it was not currently owned, a waiting thread would wake the instant that the prior owner released the lock, and in the mean time, the waiting thread would not consume any processor time.
A "busy wait" or "spin lock" algorithm trades processor time used by the waiting thread for a reduced wake-up time. Provided the thread is currently scheduled on a processor, a busy-waiter will wake as fast as the processor can transfer the necessary data for acquiring the lock and synchronizing the threads, but whilst it is waiting it will consume its maximum allotment of processor time. If the number of threads exceeds the number of available processors, this may well take time from the thread that currently owns the mutex, thus making the wait longer. However, in some cases the low latency between unlocking and locking is worth the trade-off.
On the other hand, a "blocking" mutex that uses OS facilities to put a waiting thread to sleep has a different trade-off. In this case, the time between unlocking a mutex and a waiting thread acquiring it can be quite large, possibly several hundred times larger than with a busy-wait algorithm. The benefit is that the waiting thread really does consume no processor time whilst waiting, so the OS can schedule other work whilst the thread is waiting. This can thus potentially reduce the overall wait time, and increase the overall throughput of the system.
Some mutex implementations use a combination of busy-waiting and blocking: they busy-wait for a short time, and then switch to blocking if the lock cannot be acquired in the short time. This has the benefits of the fast wake if the lock is released shortly after the thread began waiting, whilst consuming no processor time if the thread has to wait a long time. It also has the downsides of high processor usage for short waits, and slow wake-ups for long waits.
Related
I'm curious about the behavior of cpu during a thread waiting for a mutex. Now I can imagine two possibilities:
The cpu stay on the current thread and check if the mutex had been unlocked continually.
The cpu will switch to another thread(or process) for a moment and switch back to the origin thread and check temporary.
Which one is right or the stl implement in another way?
To understand this you first need to understand the difference between thread and cpu core. Thread is an abstract thing, a data structure, that is used to represent some sequence of operations to be executed. The OS assigns threads to cpu cores, and those cores then execute those operations. The OS (and also hardware) can also interrupt this execution at any time (although not in the middle of a single instruction), save such thread's state, suspend it, and assign some other thread to that core. This is also known as context switch. The OS sometimes does that on so called syscalls (when a program calls some OS's functionality, e.g. asks for the access to disk, network, etc.) as well. It is important because mutexes utilize some syscalls under the hood.
So what happens when a thread tries to access a locked mutex? First of all, no periodical checks happen. While possible, that would be a waste of cpu cycles and extremely unlikely that any serious OS does that. What actually happens is that each mutex internally has a queue associated. When it is locked, the OS will add current thread to this queue and will suspend it. Afterwards the OS will assign some other thread to this cpu core, if available.
Now if a mutex is locked, then there's a thread that actually locked that mutex. Let's call that thread an owner. This thread is not suspended, and it does some work. When it finishes whatever it is doing, it has to unlock the mutex (which is a syscall as well), otherwise those pending threads will never resume. When that (i.e. the unlocking) happens the OS will look at the associated queue, and pick a thread from it (which one is an implementation detail, it will often be some priority queue). This newly picked thread will be the new owner of the mutex, and the OS will resume it, meaning schedule the thread for execution. Schedule, because all cores may be busy at the moment.
Note that this is a brief overview of the topic. There are lots of other things and optimizations in play, like futexes and how to actually implement thread-safe (or rather core-safe) code without mutexes (these are not hardware features, mutexes are implemented in the OS). But that's more or less how things are.
Typically the thread will attempt to acquire the mutex, and if it can't (e.g. because another thread already acquired it) it will inform the scheduler and the scheduler will block the waiting thread and switch to a different thread, and then (later, when the lock is released) the scheduler will unblock the waiting thread and give it CPU time again.
On single-CPU systems; this is almost required. All CPU time spent (e.g. "spinning"/polling the lock again) between finding out the lock can't be acquired and doing a task switch (to a thread that may release the lock) is a waste of CPU time that will achieve nothing (because no other thread can release the lock until a task switch occurs).
However, research on multi-CPU systems (that I vaguely remember from about 20 years ago that may or may not have been done by Sun for Solaris) indicates that a small amount of "spinning" (in the hope that a thread running on a different CPU releases the lock in time) can be beneficial (by avoiding the cost of task switch/es). My intuition is that "time spent spinning before blocking" should be roughly equal to the cost of a task switch (or, if a task switch costs 123 microseconds, it'd probably be worthwhile spinning for 123 microseconds before the scheduler is told to block your thread); but this would depend heavily on scenario (e.g. how heavily contended the lock is, etc).
Typically,
The hardware thread (your "CPU") will be switched to running a different software thread by the kernel, and the original software thread will be set aside until the mutex it is waiting on becomes signaled. At that point the kernel will place it among the set of software threads that it seeks to schedule for execution on one of the hardware threads in the system.
Your option 1 applies to what is called a critical section on Microsoft's platforms and more generally a spinlock. See pthread_spin_lock().
Your option 2 is most similar to what usually happens.
In the Microsoft world, the Mutex is waited on with WaitForSingleObject(), which is described as
If the object's state is nonsignaled, the calling thread enters the wait state until the object is signaled or the time-out interval elapses.
Now you need to know that the "wait state" is a state where the thread is not active. We call it "blocking", which is the opposite of a busy wait where CPU time is used.
At that beginning, the kernel will immediately give the CPU to another thread and never give it back to your thread, unless the Mutex is becoming "signaled". So it will really use 0 CPU cycles during the wait.
When the kernel notices that the Mutex has changed, it can "wake up" the thread and might even boost its priority because it was waiting friendly all the time.
The cpu stay on the current thread and check if the mutex had been unlocked continually.
It's not the CPU that picks a thread to be executed. The thread scheduler of Windows will pick a thread that gets executed.
If a Mutex could block a CPU that way, you need to only 8 or 12 Mutexes to fully brick your system.
The cpu will switch to another thread(or process) for a moment [...]
Almost. There will be an interrupt by a timer. The interrupt will be handled by an interrupt service routine by the Windows kernel. At that time, the kernel can decide which thread will be executed next.
[...] and switch back to the origin thread and check temporary.
No. Because the Mutex is a kernel object, the kernel already knows that there's no used in letting the thread check again unless the Mutex has been signaled.
I have a program where sometimes bursts happen so that threads would load the CPU above 100% if that was possible, but in reality, they fight for the CPU. It is critical that a thread obtaining ownership of a synchronization primitive gets a higher priority than the other threads of the application, so to prevent the case where a thread obtains ownership and gets paused by the scheduler. Is there a suitable synchronization primitive in C++ (up to the latest draft) or WinAPI, or do I have to wrap the mutex locking code in SetThreadPriority() calls?
This isn't actually a problem. If a thread that owns a synchronization primitive gets paused by the scheduler, it would only be because there were enough ready-to-run threads to keep all the cores busy. In that case, there's no particular reason to care which thread runs.
Threads that waiting for the synchronization primitive aren't ready to run. So if you have four cores and the thread that holds the synchronization primitive isn't being blocked, it would only be because there are four threads, all ready-to-run, that can make forward progress without holding the synchronization primitive. In that case, running those four threads is just as good as running the thread that holds the synchronization primitive.
I strongly urge you not to mess with thread priorities unless you really have no choice. Once you start messing with thread priorities, the argument above can stop holding because you can get issues like priority inversion. But if you don't mess with thread priorities, then you can't run into those kinds of issues and the scheduler will be smart enough to do the right thing 99% of the time. And trying to mess with priorities to get it do the right thing that last 1% of the time will likely backfire.
The mechanism you are looking for is called a priority inheritance protocol. Pthreads offers support for this sort of configuration, and the idea is that if a high priority task is waiting for a resource held by a low priority task, the low priority task is boosted to that high priority until it relinquishes the resource.
Search for Liu and Layland, they wrote most of this up in the early 70s. As for C++, I am afraid it is a few versions away from 1973's state of the art.
Say I have a mutex and thread 1 locked the mutex. Now, thread 2 tries to acquire the lock but it is blocked, say for a couple of seconds. How expensive is this blocked thread? Can the executing hardware thread be rescheduled to do something computationally more expensive? If yes, then who checks if the mutex gets unlocked?
EDIT: Ok so I try to reformulate what I wanted to ask.
What I dont really understand is how the following works. thread 2 got blocked, so what exactly does thread 2 do? From the answer it seems like it is not just constantly checking if the mutex gets unlocked. If this were the case, I would consider a blocked thread expensive, as I am using one of my hardware threads just for checking if some boolean value changes.
So am I correct in thinking that when the mutex gets released by thread 1, thread 1 notifies the sheduler and the shedular assigns a hardware thread to execute thread 2 which is waiting?
I am reading your questions as:
How expensive is a locked mutex?
Mutex can be considered as an integer in memory.
A thread trying to lock on a mutex has to read the existing state of the mutex and can set it depending on the value read.
test_and_set( &mutex_value, 0, 1 ); // if mutex_value is 0, set to 1
The trick is that both the read and write (also called test-and-set) operation should be atomic. The atomicity is achieved with CPU support.
However, the test-and-set operation doesn't offer any mechanism to block/wait.
CPU has no knowledge of threads blocking on a mutex. The OS takes the responsibility to manage the blocking by providing system calls to users. The implementation varies from OS to OS. In case of Linux, you can consider futex or pthreads as an example.
The overall costs of using a mutex sums up to the test-and-set operation and the system calls used to implement the mutex.
The test-and set operation is almost constant and is insignificant compared to the cost the other operation can amount to.
If there a multiple threads trying to acquire the lock, the cost of
mutex can be accredited to the following:
1. Kernel scheduling overhead cost
2. Context switch overhead cost
Kernel scheduling overhead
What happens to other threads, if one thread has already acquired lock on a mutex?
The other threads will continue. If any other thread(s) attempting to lock a mutex that is already locked, OS will (re)schedule the other thread(s) to wait. As soon as the original thread unlocks the mutex, kernel will wake up one of the threads waiting on the mutex.
Context switch overhead
User space code should be designed in such a manner that a thread should spend a very less time trying to lock on a mutex. If you have multiple thread trying to acquire lock on a mutex at multiple places, it may result in a disaster and the performance may be as poor as a single thread serving all requests.
Can the executing hardware thread be resheduled to do something
computationally more expensive?
If I am getting your question correctly, the thread which has acquired the lock can be context switched, depending on the scheduling mechanism. But, that is an overhead of multi-threaded programming, itself.
Can you provide a use case, to define this problem clearly?
who checks if the mutex gets unlocked?
Definitely the OS scheduler. Note, it is not just a blind sleep().
Threads are just a logical OS concept. There are no "hardware threads". Hardware has cores. The OS schedules a core to run a thread for a certain amount of time. If a thread gets blocked, there are always plenty left to run.
Taking your example, with mutexes, if thread 2 is blocked, the OS takes it off schedule and puts it in a queue associated with the mutex. When thread 1 releases the lock, it notifies the scheduler, which takes thread 2 off the queue and puts it back on the schedule. A blocked thread isn't using compute resources. However, there is overhead involved in the actual lock/unlock operation, which is an OS scheduling call.
That overhead is not insignificant, so you would generally use mutexes if you have longer tasks (reasonably longer than a scheduling time slice) and not too much lock competition.
So if a lock goes out of scope, there is some code in the destructor that tells the OS that the locked mutex is now unlocked?
Blockquote
If a std::mutex goes out of scope while locked, that is undefined behavior. (https://en.cppreference.com/w/cpp/thread/mutex/~mutex) Even with non-std mutex implementations, it's reasonable to expect one to unlock before going out of scope.
Keep in mind that there are other kinds of "lock" (like spinlock...which itself has many versions) but we're only talking about mutexes here.
I am just wondering if there is any locking policy in C++11 which would prevent threads from starvation.
I have a bunch of threads which are competing for one mutex. Now, my problem is that the thread which is leaving a critical section starts immediately compete for the same mutex and most of the time wins. Therefore other threads waiting on the mutex are starving.
I do not want to let the thread, leaving a critical section, sleep for some minimal amount of time to give other threads a chance to lock the mutex.
I thought that there must be some parameter which would enable fair locking for threads waiting on the mutex but I wasn't able to find any appropriate solution.
Well I found std::this_thread::yield() function, which suppose to reschedule the order of threads execution, but it is only hint to scheduler thread and depends on scheduler thread implementation if it reschedule the threads or not.
Is there any way how to provide fair locking policy for the threads waiting on the same mutex in C++11?
What are the usual strategies?
Thanks
This is a common optimization in mutexes designed to avoid wasting time switching tasks when the same thread can take the mutex again. If the current thread still has time left in its time slice then you get more throughput in terms of user-instructions-executed-per-second by letting it take the mutex rather than suspending it, and switching to another thread (which likely causes a big reload of cache lines and various other delays).
If you have so much contention on a mutex that this is a problem then your application design is wrong. You have all these threads blocked on a mutex, and therefore not doing anything: you are probably better off without so many threads.
You should design your application so that if multiple threads compete for a mutex then it doesn't matter which thread gets the lock. Direct contention should also be a rare thing, especially direct contention with lots of threads.
The only situation where I can think this is an OK scenario is where every thread is waiting on a condition variable, which is then broadcast to wake them all. Every thread will then contend for the mutex, but if you are doing this right then they should all do a quick check that this isn't a spurious wake and then release the mutex. Even then, this is called a "thundering herd" situation, and is not ideal, precisely because it serializes all these threads.
I'm using pthread on Linux. I have a circular buffer to pass data from one thread to another. Maybe the circular buffer is not the best structure to use here, but changing that would not make my problem go away, so we'll just refer it as a queue.
Whenever my queue is either full or empty, pop/push operations return NULL. This is problematic since my threads fire periodically. Waiting for another thread loop would take too long.
I've tried using semaphores (sem_post, sem_wait) but unlocking under contention takes up to 25 ms, which is about the speed of my loop. I've tried waiting with pthread_cond_t, but the unlocking takes up to between 10 and 15 ms.
Is there a faster mechanism I could use to wait for data?
EDIT*
Ok I used condition variables. I'm on an embedded device so adding "more cores or cpu power" is not an option. This made me realise I had all sorts of thread priorities set all over the place so I'll sort this out before going further
You should use condition variables. The only faster ways are platform-specific, and they're only negligibly faster.
You're seeing what you think is poor performance simply because your threads are being de-scheduled. You're seeing long "delays" when your thread is near the end of its timeslice and the scheduler allows the unblocked thread to pre-empt the running thread. If you have more cores than threads or set your thread to a higher priority, you won't see these delays.
But these delays are actually a good thing, and you shouldn't be concerned about them. Other threads just get a chance to run too.