I have 4 process sharing a common semaphore, all the process have the same priority.
The critical region inside the lock, has the read/write operation including the
fflush() call.
In the logs, I observed that after giving off the semaphore from a particular process,
there is a considerable amount of time taken by other process to acquire the lock.
Since, all the 4 process gets locked at the same point, there is a performance issue on an embedded device. If the lock is shared between threads, pthread_cond_t can be used to minimize the switching time. Now, what can be done to minimize the switching time between processes?
Context switch between processes held inside kernel. It's a job of kernel scheduler to do the context switching, so you can't do much here other than trying to speed up scheduler context switching path. Another alternative might be to try figure out the problem and to improve your app by reducing lock contention (perhaps).
Related
https://www.linkedin.com/pulse/how-do-i-design-high-frequency-trading-systems-its-part-silahian-2/
avoiding cache misses and CPU’s context switching
how does busy/wait and spinning pattern avoids context switches if it runs two threads in one core ?
It will still have context switches between these two threads(producer thread and 1 consumer thread) right ?
what are the consequences if I don't set thread affinity to one specific core ?
I completely get why it avoids cache misses. But i am still having trouble how does it solve avoiding context switches.
how does busy/wait and spinning pattern avoids context switches if it runs two threads in one core ?
When a thread perform a lock and the lock is taken by another thread, it make s system call so the OS can know that the thread is waiting for a lock and this should be worthless to let it continue its execution. This causes system call and a context switch (because the OS will try to execute another threads on the processing unit) which are expensive.
Using a spin lock somehow lies to the OS by not saying the thread is waiting while this is actually the case. As a result the OS will wait the end of the quantum (time slice allocated to the thread) before doing a context switch. A quantum is generally pretty big (eg. 8 ms) so the overhead of context switches in this case does not seems a problem. However, it is if you care about latency. Indeed, if another thread also use a spin lock, this cause a very inefficient execution because each thread will be executed during the full quantum and the average latency will be half the quantum which is generally far more than the overhead of a context switch. To avoid this happening, you should be sure that there are more core than thread to be actively running and control the environment. Otherwise, spin locks are actually more harmful than context switches. If you do not control the environment, then spin locks are generally not a good idea.
If 2 threads running on the same core and 2-way SMT is enabled, then each thread will likely be executed on each of the hardware thread. In this case spin locks can significantly slow down the other thread while doing nothing. The x86-64 pause instruction can be used to tell to the processor that the thread is doing a spin lock and let the other thread of the same core be executed. This instruction also benefit from reducing contention. Note that even with the pause instruction, modern processor will run at full speed (possibly in turbo frequency) causing them to consume a lot of energy (and so heat).
what are the consequences if I don't set thread affinity to one specific core ?
Threads can then move between cores. If a thread move from one core to another it generally needs to reload data to its cache (typically fill the L2 from the L3 or another L2 cache) during cache misses that may occur on the critical path of a latency-critical operation. In the most critical cases, a thread can move from one NUMA node to another. Data transfer between NUMA nodes are generally slower. Not to mention the core will then access to data own by the memory of another NUMA node which is more expensive. Besides, it can increase the overall cost of context switches.
I'm curious about the behavior of cpu during a thread waiting for a mutex. Now I can imagine two possibilities:
The cpu stay on the current thread and check if the mutex had been unlocked continually.
The cpu will switch to another thread(or process) for a moment and switch back to the origin thread and check temporary.
Which one is right or the stl implement in another way?
To understand this you first need to understand the difference between thread and cpu core. Thread is an abstract thing, a data structure, that is used to represent some sequence of operations to be executed. The OS assigns threads to cpu cores, and those cores then execute those operations. The OS (and also hardware) can also interrupt this execution at any time (although not in the middle of a single instruction), save such thread's state, suspend it, and assign some other thread to that core. This is also known as context switch. The OS sometimes does that on so called syscalls (when a program calls some OS's functionality, e.g. asks for the access to disk, network, etc.) as well. It is important because mutexes utilize some syscalls under the hood.
So what happens when a thread tries to access a locked mutex? First of all, no periodical checks happen. While possible, that would be a waste of cpu cycles and extremely unlikely that any serious OS does that. What actually happens is that each mutex internally has a queue associated. When it is locked, the OS will add current thread to this queue and will suspend it. Afterwards the OS will assign some other thread to this cpu core, if available.
Now if a mutex is locked, then there's a thread that actually locked that mutex. Let's call that thread an owner. This thread is not suspended, and it does some work. When it finishes whatever it is doing, it has to unlock the mutex (which is a syscall as well), otherwise those pending threads will never resume. When that (i.e. the unlocking) happens the OS will look at the associated queue, and pick a thread from it (which one is an implementation detail, it will often be some priority queue). This newly picked thread will be the new owner of the mutex, and the OS will resume it, meaning schedule the thread for execution. Schedule, because all cores may be busy at the moment.
Note that this is a brief overview of the topic. There are lots of other things and optimizations in play, like futexes and how to actually implement thread-safe (or rather core-safe) code without mutexes (these are not hardware features, mutexes are implemented in the OS). But that's more or less how things are.
Typically the thread will attempt to acquire the mutex, and if it can't (e.g. because another thread already acquired it) it will inform the scheduler and the scheduler will block the waiting thread and switch to a different thread, and then (later, when the lock is released) the scheduler will unblock the waiting thread and give it CPU time again.
On single-CPU systems; this is almost required. All CPU time spent (e.g. "spinning"/polling the lock again) between finding out the lock can't be acquired and doing a task switch (to a thread that may release the lock) is a waste of CPU time that will achieve nothing (because no other thread can release the lock until a task switch occurs).
However, research on multi-CPU systems (that I vaguely remember from about 20 years ago that may or may not have been done by Sun for Solaris) indicates that a small amount of "spinning" (in the hope that a thread running on a different CPU releases the lock in time) can be beneficial (by avoiding the cost of task switch/es). My intuition is that "time spent spinning before blocking" should be roughly equal to the cost of a task switch (or, if a task switch costs 123 microseconds, it'd probably be worthwhile spinning for 123 microseconds before the scheduler is told to block your thread); but this would depend heavily on scenario (e.g. how heavily contended the lock is, etc).
Typically,
The hardware thread (your "CPU") will be switched to running a different software thread by the kernel, and the original software thread will be set aside until the mutex it is waiting on becomes signaled. At that point the kernel will place it among the set of software threads that it seeks to schedule for execution on one of the hardware threads in the system.
Your option 1 applies to what is called a critical section on Microsoft's platforms and more generally a spinlock. See pthread_spin_lock().
Your option 2 is most similar to what usually happens.
In the Microsoft world, the Mutex is waited on with WaitForSingleObject(), which is described as
If the object's state is nonsignaled, the calling thread enters the wait state until the object is signaled or the time-out interval elapses.
Now you need to know that the "wait state" is a state where the thread is not active. We call it "blocking", which is the opposite of a busy wait where CPU time is used.
At that beginning, the kernel will immediately give the CPU to another thread and never give it back to your thread, unless the Mutex is becoming "signaled". So it will really use 0 CPU cycles during the wait.
When the kernel notices that the Mutex has changed, it can "wake up" the thread and might even boost its priority because it was waiting friendly all the time.
The cpu stay on the current thread and check if the mutex had been unlocked continually.
It's not the CPU that picks a thread to be executed. The thread scheduler of Windows will pick a thread that gets executed.
If a Mutex could block a CPU that way, you need to only 8 or 12 Mutexes to fully brick your system.
The cpu will switch to another thread(or process) for a moment [...]
Almost. There will be an interrupt by a timer. The interrupt will be handled by an interrupt service routine by the Windows kernel. At that time, the kernel can decide which thread will be executed next.
[...] and switch back to the origin thread and check temporary.
No. Because the Mutex is a kernel object, the kernel already knows that there's no used in letting the thread check again unless the Mutex has been signaled.
I am writing an application that use a third-party library to perform heavy computations.
This library implements parallelism internally and spawn given number threads. I want to run several (dynamic count) instances of this library and therefore end up with quite heavily oversubscribing the cpu.
Is there any way I can increase the "time quantum" of all the threads in a process so that e.g. all the threads with normal priority rarely context switch (yield) unless they are explicitly yielded through e.g. semaphores?
That way I could possibly avoid most of the performance overhead of oversubscribing the cpu. Note that in this case I don't care if a thread is starved for a few seconds.
EDIT:
One complicated way of doing this is to perform thread scheduling manually.
Enumerate all the threads with a specific priority (e.g. normal).
Suspend all of them.
Create a loop which resumes/suspends the threads every e.g. 40 ms and makes sure no mor threads than the current cpu count is run.
Any major drawbacks with this approach? Not sure what the overhead of resume/suspending a thread is?
There is nothing special you need to do. Any decent scheduler will not allow unforced context switches to consume a significant fraction of CPU resources. Any operating system that doesn't have a decent scheduler should not be used.
The performance overhead of oversubscribing the CPU is not the cost of unforced context switches. Why? Because the scheduler can simply avoid those. The scheduler only performs an unforced context switch when that has a benefit. The performance costs are:
It can take longer to finish a job because more work will be done on other jobs between when the job is started and when the job finishes.
Additional threads consume memory for their stacks and related other tracking information.
More threads generally means more contention (for example, when memory is allocated) which can mean more forced context switches where a thread has to be switched out because it can't make forward progress.
You only want to try to change the scheduler's behavior when you know something significant that the scheduler doesn't know. There is nothing like that going on here. So the default behavior is what you want.
Any major drawbacks with this approach? Not sure what the overhead of
resume/suspending a thread is?
Yes,resume/suspend the thread is very very dangerous activity done in user mode of program. So it should not be used(almost never). Moreover we should not use these concepts to achieve something which any modern scheduler does for us. This too is mentioned in other post of this question.
The above is applicable for any operating system, but from SO post tag it appears to me that it has been asked for Microsoft Windows based system. Now if we read about the SuspendThread() from MSDN, we get the following:
"This function is primarily designed for use by debuggers. It is not intended to be used for thread synchronization. Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread".
So consider the scenario in which thread has acquired some resource(implicitly .i.e. part of not code..by library or kernel mode), and if we suspend the thread this would result into mysterious deadlock situation as other threads of that process would be waiting for that particular resource. The fact is we are not sure(at any time) in our program that what sort of resources are acquired by any running thread, suspend/resume thread is not good idea.
Do deadlocks put processes into a high rate of CPU usage, or do these two processes both "sleep", waiting on the other to finish?
I am trying to debug a multithreaded program written in C++ on a Linux system. I have noticed excessive CPU utilization from one particular process, and am wondering if it could be due to a deadlock issue. I have identified that one process consistently uses more of the CPU than I would anticipate (using top), and the process works, but it works slowly. If deadlocks cause the processes to sleep and do not cause high CPU usage, then at least I know this is not a deadlocking issue.
A deadlock typically does not cause high CPU usage, at least not if the deadlock occurs in synchronization primitives that are backed by the OS such that processes sleep while they wait.
If the deadlock occurs with i.e. lockless synchronization mechanisms (such as compare-exchange with an idle loop), CPU usage will be up.
Also, there is the notion of a livelock, which occurs when a program with multiple threads is unable to advance to some intended state because some condition (that depends on interaction between threads) cannot be fulfilled, even though none of the threads is explicitly waiting for something.
It depends on the type of lock. A lock that is implemented as a spin loop could run up 100% CPU usage in a deadlock situation.
On the other hand, a signalling lock such as a kernel mutex does not consume CPU cycles while waiting, so a deadlock on such a lock would not peg the CPU at 100%
I'm using pthread on Linux. I have a circular buffer to pass data from one thread to another. Maybe the circular buffer is not the best structure to use here, but changing that would not make my problem go away, so we'll just refer it as a queue.
Whenever my queue is either full or empty, pop/push operations return NULL. This is problematic since my threads fire periodically. Waiting for another thread loop would take too long.
I've tried using semaphores (sem_post, sem_wait) but unlocking under contention takes up to 25 ms, which is about the speed of my loop. I've tried waiting with pthread_cond_t, but the unlocking takes up to between 10 and 15 ms.
Is there a faster mechanism I could use to wait for data?
EDIT*
Ok I used condition variables. I'm on an embedded device so adding "more cores or cpu power" is not an option. This made me realise I had all sorts of thread priorities set all over the place so I'll sort this out before going further
You should use condition variables. The only faster ways are platform-specific, and they're only negligibly faster.
You're seeing what you think is poor performance simply because your threads are being de-scheduled. You're seeing long "delays" when your thread is near the end of its timeslice and the scheduler allows the unblocked thread to pre-empt the running thread. If you have more cores than threads or set your thread to a higher priority, you won't see these delays.
But these delays are actually a good thing, and you shouldn't be concerned about them. Other threads just get a chance to run too.