Conditional wait overhead - c++

When using boost::conditional_variable, ACE_Conditional or directly pthread_cond_wait, is there any overhead for the waiting itself? These are more specific issues that trouble be:
After the waiting thread is unscheduled, will it be scheduled back before the wait expires and then unscheduled again or it will stay unscheduled until signaled?
Does wait acquires periodically the mutex? In this case, I guess it wastes each iteration some CPU time on system calls to lock and release the mutex. Is it the same as constantly acquiring and releasing a mutex?
Also, then, how much time passes between the signal and the return from wait?
Afaik, when using semaphores the acquire calls responsiveness is dependent on scheduler time slice size. How does it work in pthread_cond_wait? I assume this is platform dependent. I am more interested in Linux but if someone knows how it works on other platforms, it will help too.
And one more question: are there any additional system resources allocated for each conditional? I won't create 30000 mutexes in my code, but should I worry about 30000 conditionals that use the same one mutex?

Here's what is written in the pthread_cond man page:
pthread_cond_wait atomically unlocks the mutex and waits for the condition variable cond to be signaled. The thread execution is suspended and does not consume any CPU time until the condition variable is signaled.
So from here I'd answer to the questions as following:
The waiting thread won't be scheduled back before the wait was signaled or canceled.
There are no periodic mutex acquisitions. The mutex is reacquired only once before wait returns.
The time that passes between the signal and the wait return is similar to that of thread scheduling due to mutex release.
Regarding the resources, on the same man page:
In the LinuxThreads implementation, no resources are associated with condition variables, thus pthread_cond_destroy actually does nothing except checking that the condition has no waiting threads.
Update: I dug into the sources of pthread_cond_* functions and the behavior is as follows:
All the pthread conditionals in Linux are implemented using futex.
When a thread calls wait it is suspended and unscheduled. The thread id is inserted at the tail of a list of waiting threads.
When a thread calls signal the thread at the head of the list is scheduled back.
So, the waking is as efficient as the scheduler, no OS resources are consumed and the only memory overhead is the size of the waiting list (see futex_wake function).

You should only call pthread_cond_wait if the variable is already in the "wrong" state. Since it always waits, there is always the overhead associated with putting the current thread to sleep and switching.
When the thread is unscheduled, it is unscheduled. It should not use any resources, but of course an OS can in theory be implemented badly. It is allowed to re-acquire the mutex, and even to return, before the signal (which is why you must double-check the condition), but the OS will be implemented so this doesn't impact performance much, if it happens at all. It doesn't happen spontaneously, but rather in response to another, possibly-unrelated signal.
30000 mutexes shouldn't be a problem, but some OSes might have a problem with 30000 sleeping threads.

Related

What will cpu do when a thread waiting for a mutex

I'm curious about the behavior of cpu during a thread waiting for a mutex. Now I can imagine two possibilities:
The cpu stay on the current thread and check if the mutex had been unlocked continually.
The cpu will switch to another thread(or process) for a moment and switch back to the origin thread and check temporary.
Which one is right or the stl implement in another way?
To understand this you first need to understand the difference between thread and cpu core. Thread is an abstract thing, a data structure, that is used to represent some sequence of operations to be executed. The OS assigns threads to cpu cores, and those cores then execute those operations. The OS (and also hardware) can also interrupt this execution at any time (although not in the middle of a single instruction), save such thread's state, suspend it, and assign some other thread to that core. This is also known as context switch. The OS sometimes does that on so called syscalls (when a program calls some OS's functionality, e.g. asks for the access to disk, network, etc.) as well. It is important because mutexes utilize some syscalls under the hood.
So what happens when a thread tries to access a locked mutex? First of all, no periodical checks happen. While possible, that would be a waste of cpu cycles and extremely unlikely that any serious OS does that. What actually happens is that each mutex internally has a queue associated. When it is locked, the OS will add current thread to this queue and will suspend it. Afterwards the OS will assign some other thread to this cpu core, if available.
Now if a mutex is locked, then there's a thread that actually locked that mutex. Let's call that thread an owner. This thread is not suspended, and it does some work. When it finishes whatever it is doing, it has to unlock the mutex (which is a syscall as well), otherwise those pending threads will never resume. When that (i.e. the unlocking) happens the OS will look at the associated queue, and pick a thread from it (which one is an implementation detail, it will often be some priority queue). This newly picked thread will be the new owner of the mutex, and the OS will resume it, meaning schedule the thread for execution. Schedule, because all cores may be busy at the moment.
Note that this is a brief overview of the topic. There are lots of other things and optimizations in play, like futexes and how to actually implement thread-safe (or rather core-safe) code without mutexes (these are not hardware features, mutexes are implemented in the OS). But that's more or less how things are.
Typically the thread will attempt to acquire the mutex, and if it can't (e.g. because another thread already acquired it) it will inform the scheduler and the scheduler will block the waiting thread and switch to a different thread, and then (later, when the lock is released) the scheduler will unblock the waiting thread and give it CPU time again.
On single-CPU systems; this is almost required. All CPU time spent (e.g. "spinning"/polling the lock again) between finding out the lock can't be acquired and doing a task switch (to a thread that may release the lock) is a waste of CPU time that will achieve nothing (because no other thread can release the lock until a task switch occurs).
However, research on multi-CPU systems (that I vaguely remember from about 20 years ago that may or may not have been done by Sun for Solaris) indicates that a small amount of "spinning" (in the hope that a thread running on a different CPU releases the lock in time) can be beneficial (by avoiding the cost of task switch/es). My intuition is that "time spent spinning before blocking" should be roughly equal to the cost of a task switch (or, if a task switch costs 123 microseconds, it'd probably be worthwhile spinning for 123 microseconds before the scheduler is told to block your thread); but this would depend heavily on scenario (e.g. how heavily contended the lock is, etc).
Typically,
The hardware thread (your "CPU") will be switched to running a different software thread by the kernel, and the original software thread will be set aside until the mutex it is waiting on becomes signaled. At that point the kernel will place it among the set of software threads that it seeks to schedule for execution on one of the hardware threads in the system.
Your option 1 applies to what is called a critical section on Microsoft's platforms and more generally a spinlock. See pthread_spin_lock().
Your option 2 is most similar to what usually happens.
In the Microsoft world, the Mutex is waited on with WaitForSingleObject(), which is described as
If the object's state is nonsignaled, the calling thread enters the wait state until the object is signaled or the time-out interval elapses.
Now you need to know that the "wait state" is a state where the thread is not active. We call it "blocking", which is the opposite of a busy wait where CPU time is used.
At that beginning, the kernel will immediately give the CPU to another thread and never give it back to your thread, unless the Mutex is becoming "signaled". So it will really use 0 CPU cycles during the wait.
When the kernel notices that the Mutex has changed, it can "wake up" the thread and might even boost its priority because it was waiting friendly all the time.
The cpu stay on the current thread and check if the mutex had been unlocked continually.
It's not the CPU that picks a thread to be executed. The thread scheduler of Windows will pick a thread that gets executed.
If a Mutex could block a CPU that way, you need to only 8 or 12 Mutexes to fully brick your system.
The cpu will switch to another thread(or process) for a moment [...]
Almost. There will be an interrupt by a timer. The interrupt will be handled by an interrupt service routine by the Windows kernel. At that time, the kernel can decide which thread will be executed next.
[...] and switch back to the origin thread and check temporary.
No. Because the Mutex is a kernel object, the kernel already knows that there's no used in letting the thread check again unless the Mutex has been signaled.

How expensive is a blocked mutex?

Say I have a mutex and thread 1 locked the mutex. Now, thread 2 tries to acquire the lock but it is blocked, say for a couple of seconds. How expensive is this blocked thread? Can the executing hardware thread be rescheduled to do something computationally more expensive? If yes, then who checks if the mutex gets unlocked?
EDIT: Ok so I try to reformulate what I wanted to ask.
What I dont really understand is how the following works. thread 2 got blocked, so what exactly does thread 2 do? From the answer it seems like it is not just constantly checking if the mutex gets unlocked. If this were the case, I would consider a blocked thread expensive, as I am using one of my hardware threads just for checking if some boolean value changes.
So am I correct in thinking that when the mutex gets released by thread 1, thread 1 notifies the sheduler and the shedular assigns a hardware thread to execute thread 2 which is waiting?
I am reading your questions as:
How expensive is a locked mutex?
Mutex can be considered as an integer in memory.
A thread trying to lock on a mutex has to read the existing state of the mutex and can set it depending on the value read.
test_and_set( &mutex_value, 0, 1 ); // if mutex_value is 0, set to 1
The trick is that both the read and write (also called test-and-set) operation should be atomic. The atomicity is achieved with CPU support.
However, the test-and-set operation doesn't offer any mechanism to block/wait.
CPU has no knowledge of threads blocking on a mutex. The OS takes the responsibility to manage the blocking by providing system calls to users. The implementation varies from OS to OS. In case of Linux, you can consider futex or pthreads as an example.
The overall costs of using a mutex sums up to the test-and-set operation and the system calls used to implement the mutex.
The test-and set operation is almost constant and is insignificant compared to the cost the other operation can amount to.
If there a multiple threads trying to acquire the lock, the cost of
mutex can be accredited to the following:
1. Kernel scheduling overhead cost
2. Context switch overhead cost
Kernel scheduling overhead
What happens to other threads, if one thread has already acquired lock on a mutex?
The other threads will continue. If any other thread(s) attempting to lock a mutex that is already locked, OS will (re)schedule the other thread(s) to wait. As soon as the original thread unlocks the mutex, kernel will wake up one of the threads waiting on the mutex.
Context switch overhead
User space code should be designed in such a manner that a thread should spend a very less time trying to lock on a mutex. If you have multiple thread trying to acquire lock on a mutex at multiple places, it may result in a disaster and the performance may be as poor as a single thread serving all requests.
Can the executing hardware thread be resheduled to do something
computationally more expensive?
If I am getting your question correctly, the thread which has acquired the lock can be context switched, depending on the scheduling mechanism. But, that is an overhead of multi-threaded programming, itself.
Can you provide a use case, to define this problem clearly?
who checks if the mutex gets unlocked?
Definitely the OS scheduler. Note, it is not just a blind sleep().
Threads are just a logical OS concept. There are no "hardware threads". Hardware has cores. The OS schedules a core to run a thread for a certain amount of time. If a thread gets blocked, there are always plenty left to run.
Taking your example, with mutexes, if thread 2 is blocked, the OS takes it off schedule and puts it in a queue associated with the mutex. When thread 1 releases the lock, it notifies the scheduler, which takes thread 2 off the queue and puts it back on the schedule. A blocked thread isn't using compute resources. However, there is overhead involved in the actual lock/unlock operation, which is an OS scheduling call.
That overhead is not insignificant, so you would generally use mutexes if you have longer tasks (reasonably longer than a scheduling time slice) and not too much lock competition.
So if a lock goes out of scope, there is some code in the destructor that tells the OS that the locked mutex is now unlocked?
Blockquote
If a std::mutex goes out of scope while locked, that is undefined behavior. (https://en.cppreference.com/w/cpp/thread/mutex/~mutex) Even with non-std mutex implementations, it's reasonable to expect one to unlock before going out of scope.
Keep in mind that there are other kinds of "lock" (like spinlock...which itself has many versions) but we're only talking about mutexes here.

std::mutex::lock blocking CPU usage

I want to be able to freeze and unfreeze a thread at will.
My current solution is done through callbacks and busy waiting with sleep.
This is obviously not an optimal solution.
I'm considering having the main thread lock a mutex, then have the slave thread run a function that locks and unlocks the same mutex.
My worry is the possible CPU usage if it's a true busy wait.
My question, as such, is: how does STL in C++11 specify "blocking", and if it is a busy wait, are there less CPU intensive solutions (e.g. pthreads)?
While mutexes can be used, it is not an optimal solution because mutexes should be used for resource protection (see also this q/a).
Usually, std::condition_variable instances are what you should be looking for.
It works as follows:
Create an instance of std::condition_variable and distribute it to your controlling thread and your controlled thread
In the controlled thread, create a std::unique_lock. Pass it to one of the condition variable's wait methods
In the controlling thread, invoke one of the notify methods on the condition variable.
Hope that helps.
Have a look at this answer: Multithreading, when to yield versus sleep. Locking a mutex (in the manner you've described), is a reasonable solution to your problem.
Here's an MSDN article that's worth a read. Quote:
Until threads that are suspended or blocked become ready to run, the
scheduler does not allocate any processor time to them, regardless of
their priority.
If a thread isn't being scheduled it's not being run.

sleeping a thread in the middle of execution

What happens when a thread is put to sleep by other thread, possible by main thread, in the middle of its execution?
assuming I've a function Producer. What if Consumer sleep()s the Producer in the middle of production of one unit ?
Suppose the unit is half produced. and then its put on sleep(). The integrity of system may be in a problem
The thread that sleep is invoked on is put in the idle queue by the thread scheduler and is context switched out of the CPU it is running on, so other threads can take it's place.
All context (registers, stack pointer, base pointer, etc) are saved on the thread stack, so when it's run next time, it can continue from where it left off.
The OS is constantly doing context switches between threads in order to make your system seem like it's doing multiple things. The OS thread scheduler algorithm takes care of that.
Thread scheduling and threading is a big subject, if you want to really understand it, I suggest you start reading up on it. :)
EDIT: Using sleep for thread synchronization purposes not advised, you should use proper synchronization mechanisms to tell the thread to wait for other threads, etc.
There is no problem associated with this, unless some state is mutated while the thread sleeps, so it wakes up with a different set of values than before going to sleep.
Threads are switched in and out of execution by the CPU all the time, but that does not affect the overall outcome of their execution, assuming no data races or other bugs are present.
It would be unadvisable for one thread to forcibly and synchronously interfere with the execution of another thread. One thread could send an asynchronous message to another requesting that it reschedule itself in some way, but that would be handled by the other thread when it was in a suitable state to do so.
Assuming they communicate using channels that are thread-safe, nothing bad shoudl happen, as the sleeping thread will wake up eventually and grab data from its task queue or see that some semaphore has been set and read the prodced data.
If the threads communicate using nonvolatile variables or direct function calls that change state, that's when Bad Things occur.
I don't know of a way for a thread to forcibly cause another thread to sleep. If two threads are accessing a shared resource (like an input/output queue, which seems likely for you Produce/Consumer example), then both threads may contend for the same lock. The losing thread must wait for the other thread to release the lock if the contention is not the "trylock" variety. The thread that waits is placed into a waiting queue associated with the lock, and is removed from the schedulers run queue. When the winning thread releases the lock, the code checks the queue to see if there are threads still waiting to acquire it. If there are, one is chosen as the winner and is given the lock, and placed in the scheduler run queue.

What happens when pthreads wait in mutex_lock/cond_wait?

I have a program that should get the maximum out of my cpu.
It is multithreaded via pthreads that do their job well apart from the fact that they "only" get my cores to about 60% load which is not enough in my opinion.
I am searching for the reason and am asking myself (and hereby you) if the blocking functions mutex_lock/cond_wait are candidates?
What happens when a thread cannot run on in such a function?
Does pthread switch to another thread it handles or
does the thread yield its time to the system and if the latter is the case, can I change this behavior?
Regards,
Nobody
More Information
The setting is one mainthread that fills the taskpool and countless workers that fetch jobs from there and wait on a conditional that is signaled via broadcast when a serialized calculation is done. They go on with the values from this calculation until they are done, deliver their mail and fetch the next job...
On a typical modern pthreads implementation, each thread is managed by the kernel not unlike a separate process. Any blocking call like pthread_mutex_lock or pthread_cond_wait (but also, say, read) will yield its time to the system. The system will then find another eligible thread to schedule, whether in your process or another process, and run it.
If your program is only taking 60% of the CPU, it is more likely blocked on I/O than on pthread operations, unless you have done something way too granular with your pthread operations.
If a thread is waiting on a mutex/condition, it doesn't use resources (well, uses just a tiny amount). Whenever the thread enters waiting state, control switches to other threads. When the mutex is released (or condition variable signalled), the thread wakes up and may acquire the mutex (if no other thread grabs it first), and continue to run. If however some other thread acquires the mutex (this can happen if several threads are waiting for it), the thread returns to sleeping state.