Do deadlocks cause high CPU utilization? - c++

Do deadlocks put processes into a high rate of CPU usage, or do these two processes both "sleep", waiting on the other to finish?
I am trying to debug a multithreaded program written in C++ on a Linux system. I have noticed excessive CPU utilization from one particular process, and am wondering if it could be due to a deadlock issue. I have identified that one process consistently uses more of the CPU than I would anticipate (using top), and the process works, but it works slowly. If deadlocks cause the processes to sleep and do not cause high CPU usage, then at least I know this is not a deadlocking issue.

A deadlock typically does not cause high CPU usage, at least not if the deadlock occurs in synchronization primitives that are backed by the OS such that processes sleep while they wait.
If the deadlock occurs with i.e. lockless synchronization mechanisms (such as compare-exchange with an idle loop), CPU usage will be up.
Also, there is the notion of a livelock, which occurs when a program with multiple threads is unable to advance to some intended state because some condition (that depends on interaction between threads) cannot be fulfilled, even though none of the threads is explicitly waiting for something.

It depends on the type of lock. A lock that is implemented as a spin loop could run up 100% CPU usage in a deadlock situation.
On the other hand, a signalling lock such as a kernel mutex does not consume CPU cycles while waiting, so a deadlock on such a lock would not peg the CPU at 100%

Related

Busy/wait or spinning pattern in multithreaded environment

https://www.linkedin.com/pulse/how-do-i-design-high-frequency-trading-systems-its-part-silahian-2/
avoiding cache misses and CPU’s context switching
how does busy/wait and spinning pattern avoids context switches if it runs two threads in one core ?
It will still have context switches between these two threads(producer thread and 1 consumer thread) right ?
what are the consequences if I don't set thread affinity to one specific core ?
I completely get why it avoids cache misses. But i am still having trouble how does it solve avoiding context switches.
how does busy/wait and spinning pattern avoids context switches if it runs two threads in one core ?
When a thread perform a lock and the lock is taken by another thread, it make s system call so the OS can know that the thread is waiting for a lock and this should be worthless to let it continue its execution. This causes system call and a context switch (because the OS will try to execute another threads on the processing unit) which are expensive.
Using a spin lock somehow lies to the OS by not saying the thread is waiting while this is actually the case. As a result the OS will wait the end of the quantum (time slice allocated to the thread) before doing a context switch. A quantum is generally pretty big (eg. 8 ms) so the overhead of context switches in this case does not seems a problem. However, it is if you care about latency. Indeed, if another thread also use a spin lock, this cause a very inefficient execution because each thread will be executed during the full quantum and the average latency will be half the quantum which is generally far more than the overhead of a context switch. To avoid this happening, you should be sure that there are more core than thread to be actively running and control the environment. Otherwise, spin locks are actually more harmful than context switches. If you do not control the environment, then spin locks are generally not a good idea.
If 2 threads running on the same core and 2-way SMT is enabled, then each thread will likely be executed on each of the hardware thread. In this case spin locks can significantly slow down the other thread while doing nothing. The x86-64 pause instruction can be used to tell to the processor that the thread is doing a spin lock and let the other thread of the same core be executed. This instruction also benefit from reducing contention. Note that even with the pause instruction, modern processor will run at full speed (possibly in turbo frequency) causing them to consume a lot of energy (and so heat).
what are the consequences if I don't set thread affinity to one specific core ?
Threads can then move between cores. If a thread move from one core to another it generally needs to reload data to its cache (typically fill the L2 from the L3 or another L2 cache) during cache misses that may occur on the critical path of a latency-critical operation. In the most critical cases, a thread can move from one NUMA node to another. Data transfer between NUMA nodes are generally slower. Not to mention the core will then access to data own by the memory of another NUMA node which is more expensive. Besides, it can increase the overall cost of context switches.

What will cpu do when a thread waiting for a mutex

I'm curious about the behavior of cpu during a thread waiting for a mutex. Now I can imagine two possibilities:
The cpu stay on the current thread and check if the mutex had been unlocked continually.
The cpu will switch to another thread(or process) for a moment and switch back to the origin thread and check temporary.
Which one is right or the stl implement in another way?
To understand this you first need to understand the difference between thread and cpu core. Thread is an abstract thing, a data structure, that is used to represent some sequence of operations to be executed. The OS assigns threads to cpu cores, and those cores then execute those operations. The OS (and also hardware) can also interrupt this execution at any time (although not in the middle of a single instruction), save such thread's state, suspend it, and assign some other thread to that core. This is also known as context switch. The OS sometimes does that on so called syscalls (when a program calls some OS's functionality, e.g. asks for the access to disk, network, etc.) as well. It is important because mutexes utilize some syscalls under the hood.
So what happens when a thread tries to access a locked mutex? First of all, no periodical checks happen. While possible, that would be a waste of cpu cycles and extremely unlikely that any serious OS does that. What actually happens is that each mutex internally has a queue associated. When it is locked, the OS will add current thread to this queue and will suspend it. Afterwards the OS will assign some other thread to this cpu core, if available.
Now if a mutex is locked, then there's a thread that actually locked that mutex. Let's call that thread an owner. This thread is not suspended, and it does some work. When it finishes whatever it is doing, it has to unlock the mutex (which is a syscall as well), otherwise those pending threads will never resume. When that (i.e. the unlocking) happens the OS will look at the associated queue, and pick a thread from it (which one is an implementation detail, it will often be some priority queue). This newly picked thread will be the new owner of the mutex, and the OS will resume it, meaning schedule the thread for execution. Schedule, because all cores may be busy at the moment.
Note that this is a brief overview of the topic. There are lots of other things and optimizations in play, like futexes and how to actually implement thread-safe (or rather core-safe) code without mutexes (these are not hardware features, mutexes are implemented in the OS). But that's more or less how things are.
Typically the thread will attempt to acquire the mutex, and if it can't (e.g. because another thread already acquired it) it will inform the scheduler and the scheduler will block the waiting thread and switch to a different thread, and then (later, when the lock is released) the scheduler will unblock the waiting thread and give it CPU time again.
On single-CPU systems; this is almost required. All CPU time spent (e.g. "spinning"/polling the lock again) between finding out the lock can't be acquired and doing a task switch (to a thread that may release the lock) is a waste of CPU time that will achieve nothing (because no other thread can release the lock until a task switch occurs).
However, research on multi-CPU systems (that I vaguely remember from about 20 years ago that may or may not have been done by Sun for Solaris) indicates that a small amount of "spinning" (in the hope that a thread running on a different CPU releases the lock in time) can be beneficial (by avoiding the cost of task switch/es). My intuition is that "time spent spinning before blocking" should be roughly equal to the cost of a task switch (or, if a task switch costs 123 microseconds, it'd probably be worthwhile spinning for 123 microseconds before the scheduler is told to block your thread); but this would depend heavily on scenario (e.g. how heavily contended the lock is, etc).
Typically,
The hardware thread (your "CPU") will be switched to running a different software thread by the kernel, and the original software thread will be set aside until the mutex it is waiting on becomes signaled. At that point the kernel will place it among the set of software threads that it seeks to schedule for execution on one of the hardware threads in the system.
Your option 1 applies to what is called a critical section on Microsoft's platforms and more generally a spinlock. See pthread_spin_lock().
Your option 2 is most similar to what usually happens.
In the Microsoft world, the Mutex is waited on with WaitForSingleObject(), which is described as
If the object's state is nonsignaled, the calling thread enters the wait state until the object is signaled or the time-out interval elapses.
Now you need to know that the "wait state" is a state where the thread is not active. We call it "blocking", which is the opposite of a busy wait where CPU time is used.
At that beginning, the kernel will immediately give the CPU to another thread and never give it back to your thread, unless the Mutex is becoming "signaled". So it will really use 0 CPU cycles during the wait.
When the kernel notices that the Mutex has changed, it can "wake up" the thread and might even boost its priority because it was waiting friendly all the time.
The cpu stay on the current thread and check if the mutex had been unlocked continually.
It's not the CPU that picks a thread to be executed. The thread scheduler of Windows will pick a thread that gets executed.
If a Mutex could block a CPU that way, you need to only 8 or 12 Mutexes to fully brick your system.
The cpu will switch to another thread(or process) for a moment [...]
Almost. There will be an interrupt by a timer. The interrupt will be handled by an interrupt service routine by the Windows kernel. At that time, the kernel can decide which thread will be executed next.
[...] and switch back to the origin thread and check temporary.
No. Because the Mutex is a kernel object, the kernel already knows that there's no used in letting the thread check again unless the Mutex has been signaled.

Concurrency Fairness & Deadlock

Can somebody give appropriate definitions for the terms Fairness and Deadlock. I am informed that these terms are used in concurrent processes.
In a nutshell, concurrent processes share CPU, the operating system schedules CPU bursts for each process to run. Fairness is one of the things needs to be considered in order to achieve progress, and also to prevent starvation.
Deadlock is a situation when there is a circle of dependency, where each process waits for another process to progress. you will need to read also about Mutex and critical section.

Minimize Context Switching Time between process

I have 4 process sharing a common semaphore, all the process have the same priority.
The critical region inside the lock, has the read/write operation including the
fflush() call.
In the logs, I observed that after giving off the semaphore from a particular process,
there is a considerable amount of time taken by other process to acquire the lock.
Since, all the 4 process gets locked at the same point, there is a performance issue on an embedded device. If the lock is shared between threads, pthread_cond_t can be used to minimize the switching time. Now, what can be done to minimize the switching time between processes?
Context switch between processes held inside kernel. It's a job of kernel scheduler to do the context switching, so you can't do much here other than trying to speed up scheduler context switching path. Another alternative might be to try figure out the problem and to improve your app by reducing lock contention (perhaps).

Relationship between shared memory concurrency algorithms and mutexes/semaphores

I am trying to figure out the relationship between shared memory based concurrency algorithms (Peterson's / Bakery) and the use of semaphores and mutexes.
In the first case, we have a system without OS intervention, and processes can synchronize themselves using shared memory and busy waiting.
In the second case, the OS provides processes/threads with the ability to block, and not have to busy wait.
Is there ever a situation where we'd like to use shared memory in addition to semaphores (to ensure fairness / lack of starvation), or does the OS offer a better way to do this?
(I am wondering about the general concepts, but answers specific to POSIX/Win32/JAVA threads are also interesting).
Thanks a lot!
I can't think of any circumstances where what you actually want is a busy wait. Busy waiting just consumes processor time without achieving anything. That's not to say that "busy wait" algorithms aren't useful (they are), but the "busy wait" part is not the desired property, it is just a necessary consequence of a property that is desired.
Peterson's lock algorithm and Lamport's bakery algorithm are fundamentally just implementations of the mutex concept. OS facilities provide implementations of the same concept, but with different trade-offs.
The "ideal" implementation of a mutex would have "zero overhead" --- acquiring a lock on a mutex would not take any time at all if it was not currently owned, a waiting thread would wake the instant that the prior owner released the lock, and in the mean time, the waiting thread would not consume any processor time.
A "busy wait" or "spin lock" algorithm trades processor time used by the waiting thread for a reduced wake-up time. Provided the thread is currently scheduled on a processor, a busy-waiter will wake as fast as the processor can transfer the necessary data for acquiring the lock and synchronizing the threads, but whilst it is waiting it will consume its maximum allotment of processor time. If the number of threads exceeds the number of available processors, this may well take time from the thread that currently owns the mutex, thus making the wait longer. However, in some cases the low latency between unlocking and locking is worth the trade-off.
On the other hand, a "blocking" mutex that uses OS facilities to put a waiting thread to sleep has a different trade-off. In this case, the time between unlocking a mutex and a waiting thread acquiring it can be quite large, possibly several hundred times larger than with a busy-wait algorithm. The benefit is that the waiting thread really does consume no processor time whilst waiting, so the OS can schedule other work whilst the thread is waiting. This can thus potentially reduce the overall wait time, and increase the overall throughput of the system.
Some mutex implementations use a combination of busy-waiting and blocking: they busy-wait for a short time, and then switch to blocking if the lock cannot be acquired in the short time. This has the benefits of the fast wake if the lock is released shortly after the thread began waiting, whilst consuming no processor time if the thread has to wait a long time. It also has the downsides of high processor usage for short waits, and slow wake-ups for long waits.