https://www.linkedin.com/pulse/how-do-i-design-high-frequency-trading-systems-its-part-silahian-2/
avoiding cache misses and CPU’s context switching
how does busy/wait and spinning pattern avoids context switches if it runs two threads in one core ?
It will still have context switches between these two threads(producer thread and 1 consumer thread) right ?
what are the consequences if I don't set thread affinity to one specific core ?
I completely get why it avoids cache misses. But i am still having trouble how does it solve avoiding context switches.
how does busy/wait and spinning pattern avoids context switches if it runs two threads in one core ?
When a thread perform a lock and the lock is taken by another thread, it make s system call so the OS can know that the thread is waiting for a lock and this should be worthless to let it continue its execution. This causes system call and a context switch (because the OS will try to execute another threads on the processing unit) which are expensive.
Using a spin lock somehow lies to the OS by not saying the thread is waiting while this is actually the case. As a result the OS will wait the end of the quantum (time slice allocated to the thread) before doing a context switch. A quantum is generally pretty big (eg. 8 ms) so the overhead of context switches in this case does not seems a problem. However, it is if you care about latency. Indeed, if another thread also use a spin lock, this cause a very inefficient execution because each thread will be executed during the full quantum and the average latency will be half the quantum which is generally far more than the overhead of a context switch. To avoid this happening, you should be sure that there are more core than thread to be actively running and control the environment. Otherwise, spin locks are actually more harmful than context switches. If you do not control the environment, then spin locks are generally not a good idea.
If 2 threads running on the same core and 2-way SMT is enabled, then each thread will likely be executed on each of the hardware thread. In this case spin locks can significantly slow down the other thread while doing nothing. The x86-64 pause instruction can be used to tell to the processor that the thread is doing a spin lock and let the other thread of the same core be executed. This instruction also benefit from reducing contention. Note that even with the pause instruction, modern processor will run at full speed (possibly in turbo frequency) causing them to consume a lot of energy (and so heat).
what are the consequences if I don't set thread affinity to one specific core ?
Threads can then move between cores. If a thread move from one core to another it generally needs to reload data to its cache (typically fill the L2 from the L3 or another L2 cache) during cache misses that may occur on the critical path of a latency-critical operation. In the most critical cases, a thread can move from one NUMA node to another. Data transfer between NUMA nodes are generally slower. Not to mention the core will then access to data own by the memory of another NUMA node which is more expensive. Besides, it can increase the overall cost of context switches.
Related
I'm curious about the behavior of cpu during a thread waiting for a mutex. Now I can imagine two possibilities:
The cpu stay on the current thread and check if the mutex had been unlocked continually.
The cpu will switch to another thread(or process) for a moment and switch back to the origin thread and check temporary.
Which one is right or the stl implement in another way?
To understand this you first need to understand the difference between thread and cpu core. Thread is an abstract thing, a data structure, that is used to represent some sequence of operations to be executed. The OS assigns threads to cpu cores, and those cores then execute those operations. The OS (and also hardware) can also interrupt this execution at any time (although not in the middle of a single instruction), save such thread's state, suspend it, and assign some other thread to that core. This is also known as context switch. The OS sometimes does that on so called syscalls (when a program calls some OS's functionality, e.g. asks for the access to disk, network, etc.) as well. It is important because mutexes utilize some syscalls under the hood.
So what happens when a thread tries to access a locked mutex? First of all, no periodical checks happen. While possible, that would be a waste of cpu cycles and extremely unlikely that any serious OS does that. What actually happens is that each mutex internally has a queue associated. When it is locked, the OS will add current thread to this queue and will suspend it. Afterwards the OS will assign some other thread to this cpu core, if available.
Now if a mutex is locked, then there's a thread that actually locked that mutex. Let's call that thread an owner. This thread is not suspended, and it does some work. When it finishes whatever it is doing, it has to unlock the mutex (which is a syscall as well), otherwise those pending threads will never resume. When that (i.e. the unlocking) happens the OS will look at the associated queue, and pick a thread from it (which one is an implementation detail, it will often be some priority queue). This newly picked thread will be the new owner of the mutex, and the OS will resume it, meaning schedule the thread for execution. Schedule, because all cores may be busy at the moment.
Note that this is a brief overview of the topic. There are lots of other things and optimizations in play, like futexes and how to actually implement thread-safe (or rather core-safe) code without mutexes (these are not hardware features, mutexes are implemented in the OS). But that's more or less how things are.
Typically the thread will attempt to acquire the mutex, and if it can't (e.g. because another thread already acquired it) it will inform the scheduler and the scheduler will block the waiting thread and switch to a different thread, and then (later, when the lock is released) the scheduler will unblock the waiting thread and give it CPU time again.
On single-CPU systems; this is almost required. All CPU time spent (e.g. "spinning"/polling the lock again) between finding out the lock can't be acquired and doing a task switch (to a thread that may release the lock) is a waste of CPU time that will achieve nothing (because no other thread can release the lock until a task switch occurs).
However, research on multi-CPU systems (that I vaguely remember from about 20 years ago that may or may not have been done by Sun for Solaris) indicates that a small amount of "spinning" (in the hope that a thread running on a different CPU releases the lock in time) can be beneficial (by avoiding the cost of task switch/es). My intuition is that "time spent spinning before blocking" should be roughly equal to the cost of a task switch (or, if a task switch costs 123 microseconds, it'd probably be worthwhile spinning for 123 microseconds before the scheduler is told to block your thread); but this would depend heavily on scenario (e.g. how heavily contended the lock is, etc).
Typically,
The hardware thread (your "CPU") will be switched to running a different software thread by the kernel, and the original software thread will be set aside until the mutex it is waiting on becomes signaled. At that point the kernel will place it among the set of software threads that it seeks to schedule for execution on one of the hardware threads in the system.
Your option 1 applies to what is called a critical section on Microsoft's platforms and more generally a spinlock. See pthread_spin_lock().
Your option 2 is most similar to what usually happens.
In the Microsoft world, the Mutex is waited on with WaitForSingleObject(), which is described as
If the object's state is nonsignaled, the calling thread enters the wait state until the object is signaled or the time-out interval elapses.
Now you need to know that the "wait state" is a state where the thread is not active. We call it "blocking", which is the opposite of a busy wait where CPU time is used.
At that beginning, the kernel will immediately give the CPU to another thread and never give it back to your thread, unless the Mutex is becoming "signaled". So it will really use 0 CPU cycles during the wait.
When the kernel notices that the Mutex has changed, it can "wake up" the thread and might even boost its priority because it was waiting friendly all the time.
The cpu stay on the current thread and check if the mutex had been unlocked continually.
It's not the CPU that picks a thread to be executed. The thread scheduler of Windows will pick a thread that gets executed.
If a Mutex could block a CPU that way, you need to only 8 or 12 Mutexes to fully brick your system.
The cpu will switch to another thread(or process) for a moment [...]
Almost. There will be an interrupt by a timer. The interrupt will be handled by an interrupt service routine by the Windows kernel. At that time, the kernel can decide which thread will be executed next.
[...] and switch back to the origin thread and check temporary.
No. Because the Mutex is a kernel object, the kernel already knows that there's no used in letting the thread check again unless the Mutex has been signaled.
I am writing an application that use a third-party library to perform heavy computations.
This library implements parallelism internally and spawn given number threads. I want to run several (dynamic count) instances of this library and therefore end up with quite heavily oversubscribing the cpu.
Is there any way I can increase the "time quantum" of all the threads in a process so that e.g. all the threads with normal priority rarely context switch (yield) unless they are explicitly yielded through e.g. semaphores?
That way I could possibly avoid most of the performance overhead of oversubscribing the cpu. Note that in this case I don't care if a thread is starved for a few seconds.
EDIT:
One complicated way of doing this is to perform thread scheduling manually.
Enumerate all the threads with a specific priority (e.g. normal).
Suspend all of them.
Create a loop which resumes/suspends the threads every e.g. 40 ms and makes sure no mor threads than the current cpu count is run.
Any major drawbacks with this approach? Not sure what the overhead of resume/suspending a thread is?
There is nothing special you need to do. Any decent scheduler will not allow unforced context switches to consume a significant fraction of CPU resources. Any operating system that doesn't have a decent scheduler should not be used.
The performance overhead of oversubscribing the CPU is not the cost of unforced context switches. Why? Because the scheduler can simply avoid those. The scheduler only performs an unforced context switch when that has a benefit. The performance costs are:
It can take longer to finish a job because more work will be done on other jobs between when the job is started and when the job finishes.
Additional threads consume memory for their stacks and related other tracking information.
More threads generally means more contention (for example, when memory is allocated) which can mean more forced context switches where a thread has to be switched out because it can't make forward progress.
You only want to try to change the scheduler's behavior when you know something significant that the scheduler doesn't know. There is nothing like that going on here. So the default behavior is what you want.
Any major drawbacks with this approach? Not sure what the overhead of
resume/suspending a thread is?
Yes,resume/suspend the thread is very very dangerous activity done in user mode of program. So it should not be used(almost never). Moreover we should not use these concepts to achieve something which any modern scheduler does for us. This too is mentioned in other post of this question.
The above is applicable for any operating system, but from SO post tag it appears to me that it has been asked for Microsoft Windows based system. Now if we read about the SuspendThread() from MSDN, we get the following:
"This function is primarily designed for use by debuggers. It is not intended to be used for thread synchronization. Calling SuspendThread on a thread that owns a synchronization object, such as a mutex or critical section, can lead to a deadlock if the calling thread tries to obtain a synchronization object owned by a suspended thread".
So consider the scenario in which thread has acquired some resource(implicitly .i.e. part of not code..by library or kernel mode), and if we suspend the thread this would result into mysterious deadlock situation as other threads of that process would be waiting for that particular resource. The fact is we are not sure(at any time) in our program that what sort of resources are acquired by any running thread, suspend/resume thread is not good idea.
Do deadlocks put processes into a high rate of CPU usage, or do these two processes both "sleep", waiting on the other to finish?
I am trying to debug a multithreaded program written in C++ on a Linux system. I have noticed excessive CPU utilization from one particular process, and am wondering if it could be due to a deadlock issue. I have identified that one process consistently uses more of the CPU than I would anticipate (using top), and the process works, but it works slowly. If deadlocks cause the processes to sleep and do not cause high CPU usage, then at least I know this is not a deadlocking issue.
A deadlock typically does not cause high CPU usage, at least not if the deadlock occurs in synchronization primitives that are backed by the OS such that processes sleep while they wait.
If the deadlock occurs with i.e. lockless synchronization mechanisms (such as compare-exchange with an idle loop), CPU usage will be up.
Also, there is the notion of a livelock, which occurs when a program with multiple threads is unable to advance to some intended state because some condition (that depends on interaction between threads) cannot be fulfilled, even though none of the threads is explicitly waiting for something.
It depends on the type of lock. A lock that is implemented as a spin loop could run up 100% CPU usage in a deadlock situation.
On the other hand, a signalling lock such as a kernel mutex does not consume CPU cycles while waiting, so a deadlock on such a lock would not peg the CPU at 100%
I'm using busy waiting to synchronize access to critical regions, like this:
while (p1_flag != T_ID);
/* begin: critical section */
for (int i=0; i<N; i++) {
...
}
/* end: critical section */
p1_flag++;
p1_flag is a global volatile variable that is updated by another concurrent thread. As a matter of fact, I've two critical sections inside a loop and I've two threads (both executing the same loop) that commute execution of these critical regions. For instance, critical regions are named A and B.
Thread 1 Thread 2
A
B A
A B
B A
A B
B A
B
The parallel code executes faster than the serial one, however not as much as I expected. Profiling the parallel program using VTune Amplifier I noticed that a large amount of time is being spent in the synchronization directives, that is, the while(...) and flag update. I'm not sure why I'm seeing so large overhead on these "instructions" since region A is exactly the same as region B. My best guess is that this is due the cache coherence latency: I'm using an Intel i7 Ivy Bridge Machine and this micro-architecture resolves cache coherence at the L3. VTune also tells that the while (...) instruction is consuming all front-end bandwidth, but why?
To make the question(s) clear: Why are while(...) and update flag instructions taking so much execution time? Why would the while(...) instruction saturate the front-end bandwidth?
The overhead you're paying may very well be due to passing the sync variable back and forth between the core caches.
Cache coherency dictates that when you modify the cache line (p1_flag++) you need to have ownership on it. This means it would invalidate any copy existing in other cores, waiting for it to write back any changes made by that other core to a shared cache level. It would then provide the line to the requesting core in M state and perform the modification.
However, the other core would by then be constantly reading this line, read that would snoop the first core and ask if it has copy of that line. Since the first core is holding an M copy of that line, it would get written back to the shared cache and the core would lose ownership.
Now this depends on actual implementation in HW, but if the line was snooped before the change was actually made, the first core would have to attempt to get ownership on it again. In some cases i'd imagine this might take several iterations of attempts.
If you're set on using busy wait, you should at least use some pause inside it
: _mm_pause intrisic, or just __asm("pause"). This would both serve to give the other thread a chance get the lock and release you from waiting, as well as reducing the CPU effort in busy waiting (an out-of-order CPU would fill all pipelines with parallel instances of this busy wait, consuming lots of power - a pause would serialize it so only a single iteration can run at any given time - much less consuming and with the same effect).
A busy-wait is almost never a good idea in multithreaded applications.
When you busy-wait, thread scheduling algorithms will have no way of knowing that your loop is waiting on another thread, so they must allocate time as if your thread is doing useful work. And it does take processor time to check that variable over, and over, and over, and over, and over, and over...until it is finally "unlocked" by the other thread. In the meantime, your other thread will be preempted by your busy-waiting thread again and again, for no purpose at all.
This is an even worse problem if the scheduler is a priority-based one, and the busy-waiting thread is at a higher priority. In this situation, the lower-priority thread will NEVER preempt the higher-priority thread, thus you have a deadlock situation.
You should ALWAYS use semaphores or mutex objects or messaging to synchronize threads. I've never seen a situation where a busy-wait was the right solution.
When you use a semaphore or mutex, then the scheduler knows never to schedule that thread until the semaphore or mutex is released. Thus your thread will never be taking time away from threads that do real work.
I have 4 process sharing a common semaphore, all the process have the same priority.
The critical region inside the lock, has the read/write operation including the
fflush() call.
In the logs, I observed that after giving off the semaphore from a particular process,
there is a considerable amount of time taken by other process to acquire the lock.
Since, all the 4 process gets locked at the same point, there is a performance issue on an embedded device. If the lock is shared between threads, pthread_cond_t can be used to minimize the switching time. Now, what can be done to minimize the switching time between processes?
Context switch between processes held inside kernel. It's a job of kernel scheduler to do the context switching, so you can't do much here other than trying to speed up scheduler context switching path. Another alternative might be to try figure out the problem and to improve your app by reducing lock contention (perhaps).