should I "bind" "spinning" thread to the certain core? - c++

My application contains several latency-critical threads that "spin", i.e. never blocks.
Such thread expected to take 100% of one CPU core. However it seems modern operation systems often transfer threads from one core to another. So, for example, with this Windows code:
void Processor::ConnectionThread()
{
while (work)
{
Iterate();
}
}
I do not see "100% occupied" core in Task manager, overall system load is 36-40%.
But if I change it to this:
void Processor::ConnectionThread()
{
SetThreadAffinityMask(GetCurrentThread(), 2);
while (work)
{
Iterate();
}
}
Then I do see that one of the CPU cores is 100% occupied, also overall system load is reduced to 34-36%.
Does it mean that I should tend to SetThreadAffinityMask for "spin" threads? If I improved latency adding SetThreadAffinityMask in this case? What else should I do for "spin" threads to improve latency?
I'm in the middle of porting my application to Linux, so this question is more about Linux if this matters.
upd found this slide which shows that binding busy-waiting thread to CPU may help:

Running a thread locked to a single core gives the best latency for that thread in most circumstances if this is the most important thing in your code.
The reasons(R) are
your code is likely to be in your iCache
the branch predictors are tuned to your code
your data is likely to be ready in your dCache
the TLB points to your code and data.
Unless
Your running a SMT sytem (ex. hyperthreaded) in which case the evil twin will "help" you with by causing your code to be washed out, your branch predictors to be tuned to its code and its data will push your out of the dCache, your TLB is impacted by its use.
Cost unknown, each cache misses cost ~4ns, ~15ns and ~75ns for data, this quickly runs up to several 1000ns.
It saves for each reason R mentioned above, that is still there.
If the evil twin also just spins the costs should be much lower.
Or your allowing interrupts on your core, in which case you get the same problems and
your TLB is flushed
you take a 1000ns-20000ns hit on the context switch, most should be in the low end if the drivers are well programmed.
Or you allow the OS to switch your process out, in which case you have the same problems as the interrupt, just in the hight end of the range.
switching out could also cause the thread to pause for the entire slice as it can only be run on one (or two) hardware threads.
Or you use any system calls that cause context switches.
No disk IO at all.
only async IO else.
having more active (none-paused) threads than cores increases the likelihood of problems.
So if you need less than 100ns latency to keep your application from exploding you need to prevent or lessen the impact of SMT, interrupts and task switching on your core.
The perfect solution would be an Real time operating system with static scheduling. This is a nearly perfect match for your target, but its a new world if your have mostly done server and desktop programming.
The disadvantages of locking a thread to a single core are:
It will cost some total throughput.
as some threads that might have run if the context could have been switched.
but the latency is more important in this case.
If the thread gets context switched out it will take some time before it can be scheduled potentially one or more time slices, typically 10-16ms, which is unacceptable in this application.
Locking it to a core and its SMT will lessen this problem, but not eliminate it. Each added core will lessen the problem.
setting its priority higher will lessen the problem, but not eliminate it.
schedule with SCHED_FIFO and highest priority will prevent most context switches, interrupts can still cause temporary switches as does some system calls.
If you got a multi cpu setup you might be able to take exclusive ownership of one of the CPU's through cpuset. This prevents other applications from using it.
Using pthread_setschedparam with SCHED_FIFO and highest priority running in SU and locking it to the core and its evil twin should secure the best latency of all of these, only a real time operating system can eliminate all context switches.
Other links:
Discussion on interrupts.
Your Linux might accept that you call sched_setscheduler, using SCHED_FIFO, but this demands you got your own PID not just a TID or that your threads are cooperative multitasking.
This might not ideal as all your threads would only be switches "voluntarily" and thereby removing flexibility for the kernel to schedule it.
Interprocess communication in 100ns

Pinning a task to specific processor will generally give better performance for the task. But, there are a lot of nuances and costs to consider when doing so.
When you force affinity, you restrict the operating system's scheduling choices. You increase cpu contention for the remaining tasks. So EVERYTHING else on the system is impacted including the operating system itself. You also need to consider that if tasks need to communicate across memory, and affinities are set to cpus that don't share cache, you can drastically increase latency for communication across tasks.
One of the biggest reasons setting task cpu affinity is beneficial though, is that it gives more predictable cache and tlb (translation lookaside buffer) behavior. When a task switches cpus, the operating system can switch it to a cpu that doesn't have access to the last cpu's cache or tlb. This can increase cache misses for the task. It's particularly an issue communicating across tasks, as it takes more time to communicate across higher level caches and worst finally memory. To measure cache statistics on linux (performance in general) I recommend using perf.
The best suggestion is really to measure before you try to fix affinities. A good way to quantify latency would be by using the rdtsc instruction (at least on x86). This reads the cpu's time source, which will generally give the highest precision. Measuring across events will give roughly nanosecond accuracy.
volatile uint64_t rdtsc() {
register uint32_t eax, edx;
asm volatile (".byte 0x0f, 0x31" : "=d"(edx), "=a"(eax) : : );
return ((uint64_t) edx << 32) | (uint64_t) eax;
}
note - the rdtsc instruction needs to be combined with a load fence to ensure all previous instructions have completed (or use rdtscp)
also note - if rdtsc is used without an invariant time source (on linux grep constant_tsc /proc/cpuinfo, you may get unreliable values across frequency changes and if the task switches cpu (time source)
So, in general, yes, setting the affinity does gives lower latency, but this is not always true, and there are very serious costs when you do it.
Some additional reading...
Intel 64 Architecture Processor Topology Enumeration
What Every Programmer Needs to Know About Memory (Parts 2, 3, 4, 6, and 7)
Intel Software Developer Reference (Vol. 2A/2B)
Aquire and Release Fences
TCMalloc

I came across this question because I'm dealing with the exactly same design problem. I'm building HFT systems where each nanosecond count.
After reading all the answers, I decided to implement and benchmark 4 different approaches
busy wait with no affinity set
busy wait with affinity set
observer pattern
signals
The imbatible winner was "busy wait with affinity set". No doubt about it.
Now, as many have pointed out, make sure to leave a couple of cores free in order to allow OS run freely.
My only concern at this point is if there is some physical harm to those cores that are running at 100% for hours.

Binding a thread to a specific core is probably not the best way to get the job done. You can do that, it will not harm a multi core CPU.
The really best way to reduce latency is to raise the priority of the process and the polling thread(s). Normally the OS will interrupt your threads hundreds of times a second and let other threads run for a while. Your thread may not run for several milliseconds.
Raising the priority will reduce the effect (but not eliminate it).
Read more about SetThreadPriority and SetProcessPriorityBoost.
There some details in the docs you need to understand.

This is simply foolish. All it does is reduce the scheduler's flexibility. Whereas before it could run it on whatever core it thought was best, now it can't. Unless the scheduler was written by idiots, it would only move the thread to a different core if it had a good reason to do that.
So you're just saying to the scheduler, "even if you have a really good reason to do this, don't do it anyway". Why would you say that?

Related

How to multithread core-schedule onto different cores (ideally in C++)

I have a large C++11 multithreaded application where the threads are always active, communicating to each other constantly, and should be scheduled on different physical CPUs for reasonable performance.
The default Linux behavior AFAIK is that threads will typically/often get scheduled onto the same CPU, causing horrible performance.
To solve this, I understand how to attach threads to specific physical CPUs in C++, e.g.:
std::cout << "Assign to thread cpu " << cpu << "\n";
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
int rc = pthread_setaffinity_np(thread.native_handle(), sizeof(cpu_set_t), &cpuset);
and can use this to pin to specific CPUs, e.g. attach 4 threads to CPUs 0,2,4,6.
However this approach requires a specific CPU number which is a problem in that there may be many programs running on a host using other CPUs. These might be my program or other programs. As just one example an 8 core machine might have two copies of my 4-threaded application so obviously having both of those two programs pick the same 4 CPUs is a problem.
I'd thus like a way to say "schedule the threads in this set all on different CPUs without caring of the CPU number". Is this possible in C++(11)?
If not, is this possible with numactl or another utility? E.g. I don't want "numactl -C 0,2,4,6" but rather "numactl -C W,X,Y,Z" where the scheduler can pick arbitrary W,X,Y,Z subject to W!=X!=Y!=Z.
I'm most interested in Linux behavior. I cannot change the OS configuration. I don't want the separate applications to cross communicate (nor can they as they might be other applications I do not control.)
Once I have the answer to this, the follow up is how do I modify this to add a e.g. fifth thread I do want to schedule on the same CPU as the first thread?
My problem in a specific Boost ASIO multithreaded application is, that even with a limited number of threads (like ten) on a system with much more cores, the threads get pushed around onto different cores all the time, which seriously reduces performances due to a high number of L1/L2 cache misses.
I have not searched much, yet, but there is a getcpu() system call on Linux, that returns the CPU-ID and NUMA Node-ID of the active thread, that is calling getcpu(). To get a set of unique CPU-IDs, one could try to create all threads, first, then let them all wait for a barrier via pthread_barrier_wait() and after that call getcpu() repeatedly in each thread until the returned values have stabilized. Stability has been reached, when each thread has gotten the same CPU-ID as answer for at least the last 1000 calls to getcpu() AND all the answers to all the different threads are different. It is of extreme importance to use non-blocking techniques like std::atomic values to synchronize during this testing phase. Because, if you wait for some Mutexes instead, the likelyhood is high, that your threads get re-mixed again by the scheduler.
After stability has been reached, each thread just sets its CPU affinity to its current CPU-ID and you are done.
In many cases, where you do not dynamically start and stop a lot of applications, hand-binding the threads to certain Cores might be the easiest solution, though. And if you do start and stop a lot of apps dynamically, the "pick N free cores" algo described above will fail miserably, if there aren't enough free cores left, anyways.

Analyzing spikes in performance measurement

I have a set of C++ functions which does some image processing related operation. Generally I see that the final output is delivered in 5-6ms time range. I am measuring the time taken using QueryPerformanceCounter Win32 API. But when running in a continuous loop with 100 images, I see that the performance spikes up to 20ms for some images. My question is how do I go about analyzing such issues. Basically I want to determine whether the spikes are caused due to some delay in this code or whether some other task started running inside the CPU because of which this operation took time. I have tried using GetThreadTimes API to see how much time my thread spent inside CPU but am unable to conclude based on those numbers. What is the standard way to go about troubleshooting these types of issues?
Reason behind sudden spikes during processing could be any of IO, interrupt, scheduled processes etc.
It is very common to see such spikes considering such low latency/processing time operations. IMO you can consider them because of any of the above mentioned reasons (There could be more). Simplest solution is run same experiment with more inputs multiple times and take the average for final consideration.
To answer your question about checking/confirming source of the spike you can try following,
Check variation in images - already ruled out as per your comment
Monitor resource utilization during processing. Check if any resource is choking (% util is simplest way to check and SAR/NMON utility on linux is best with minimal overhead)
Reserve few CPU's on system (CPU Affinity) for your experiment which are dedicated only for your program and no OS task will run on them. Taskset is simplest utility to try out. More details are here.
Run the experiment with this setting and check behavior.
That's a nasty thing you are trying to figure out, I wouldn'd even attempt to, since coming into concrete conlusions is hard.
In general, one should run a loop of many iterations (100 just seems too small I think), and then take the average time for an image to be processed.
That will rule out any unexpected exterior events that may have hurt performance of your program.
A typical way to check if "some other task started running inside the CPU" would be to run your program once and mark the images that produce that spike. Example, image 2, 4, 5, and 67 take too long to be processed. Run your program again some times, and mark again which images produce the spikes.
If the same images produce these spikes, then it's not something caused by another exterior task.
What is the standard way to go about troubleshooting these types of issues?
There are Real Time Operating Systems (RTOS) which guarantee those kind of delays. It is totally different class of operating systems than Windows or Linux.
But still, there are something you can do about your delays even on general purpose OS.
1. Avoid system calls
Once you ask your OS to read or write something to a disk -- there are no guarantees whatever about delays. So, avoid any system functions on you critical path:
even functions like gettimeofday() might cause unpredictable delays, so you should really avoid any system calls in time-critical code;
use another thread to perform IO and pass data via a shared buffer to your critical code.
If your code base is big, tools like strace on Linux or Dr Memory on Windows to trace system calls.
2. Avoid context switches
The multi threading on Windows is preemptive. It means, there is a system scheduler, which might stop your thread any time and schedule another thread on your CPU. As previously, there are RTOSes, which allow to avoid such context switches, but there is something you can do about it:
make sure there is at least one CPU core left for system and other tasks;
bind each of your threads to a dedicated CPU with SetThreadAffinityMask() (Windows) or sched_setaffinity() (Linux) -- this effectively hints system scheduler to avoid scheduling other threads on this CPU;
make sure hardware interrupts go to another CPU; usually interrupts go to CPU 0, so the easiest way would be to bind your thread with CPU 1+;
increase your thread priority, so scheduler less likely to switch your thread with another one.
There are tools like perf (Linux) and Intel VTune (Windows) to confirm there are context switches.
3. Avoid other non-deterministic features
Few more sources of unexpected delays:
disable swap, so you know for sure your thread memory will not be swapped on slow and unpredictable disk drive;
disable CPU turbo boost -- after a high-performance CPU boosts, there is always a slow down, so the CPU stays withing its thermal power (TDP);
disable hyper threading -- from scheduler point of view those are independent CPUs, but in fact performance of each hyper-thread CPU depend on what another thread is doing at the moment.
Hope this helps.

Where's the balance between thread amount and thread block times?

Elongated question:
When having more blocking threads then CPU cores, where's the balance between thread amount and thread block times to maximize CPU efficiency by reducing context switch overhead?
I have a wide variety of IO devices that I need to control on Windows 7, with a x64 multi-core processor: PCI devices, network devices, stuff being saved to hard drives, big chunks of data being copied,... The most common policy is: "Put a thread on it!". Several dozen threads later, this is starting to feel like a bad idea.
None of my cores are being used 100%, and there's several cores who're still idling, but there are delays showing up in the range of 10 to 100ms who cannot be explained by IO blockage or CPU intensive usage. Other processes don't seem to require resources either. I'm suspecting context switch overhead.
There's a bunch of possible solutions I have:
Reduce threads by bundling the same IO devices: This mainly goes for the hard drive, but maybe for the network as well. If I'm saving 20MB to the hard drive in one thread, and 10MB in the other, wouldn't it be better to post it all to the same? How would this work in case of multiple hard drives?
Reduce threads by bundling similar IO devices, and increase it's priority: Dozens of threads with increased priority are probably gonna make my user interface thread stutter. But I can bundle all that functionality together in 1 or a couple of threads and increase it's priority.
Any case studies tackling similar problems are much appreciated.
First, it sounds like these tasks should be performed using asynchronous I/O (IO Completion Ports, preferably), rather than with separate threads. Blocking threads are generally the wrong way to do I/O.
Second, blocked threads shouldn't affect context switching. The scheduler has to juggle all the active threads, and so, having a lot of threads running (not blocked) might slow down context switching a bit. But as long as most of your threads are blocked, they shouldn't affect the ones that aren't.
10-100ms with some cores idle: it's not context-switching overhead in itself since a switch is orders of magnitude faster than these delays, even with a core swap and cache flush.
Async I/O would not help much here. The kernel thread pools that implement ASIO also have to be scheduled/swapped, albeit this is faster than user-space threads since there are fewer Wagnerian ring-cycles. I would certainly head for ASIO if the CPU loading was becoming an issue, but it's not.
You are not short of CPU, so what is it? Is there much thrashing - RAM shortage? Excessive paging can surely result in large delays. Where is your page file? I've shoved mine off Drive C onto another fast SATA drive.
PCI bandwidth? You got a couple of TV cards in there?
Disk controller flushing activity - have you got an SSD that's approaching capacity? That's always a good one for unexplained pauses. I get the odd pause even though my 128G SSD is only 2/3 full.
I've never had a problem specifically related to context-swap time and I've been writing multiThreaded apps for decades. Windows OS schedules & despatches the ready threads onto cores reasonably quickly. 'Several dozen threads' in itself, (ie. not all running!), is not remotely a problem - looking now at my TaskManger/performance, I have 1213 threads loaded on and no performance issues at all with ~6% CPU usage, (app on test running in background, bitTorrent etc). Firefox has 30 threads, VLC media player 27, my test app 23. No problem at all writing this post.
Given your issue of 10-100ms delays, I would be amazed if fiddling with thread priorities and/or changing the way your work is loaded onto threads provides any improvement - something else is stuffing your system, (you haven't got any drivers that I coded, have you? :).
Does perfmon give any clues?
Rgds,
Martin
I don't think that there is a conclusive answer, and it probably depends
on your OS as well; some handle threads better than others. Still,
delays in the 10 to 100 ms range are not due to context switching itself
(although they could be due to characteristics of the scheduling
algorithm). My experience under Windows is that I/O is very
inefficient, and if you're doing I/O, of any type, you will block. And
that I/O by one process or thread will end up blocking other processes
or threads. (Under Windows, for example, there's probably no point in
having more than one thread handle the hard drive. You can't read or
write several sectors at the same time, and my impression is that
Windows doesn't optimize accesses like some other systems do.)
With regards to your exact questions:
"If I'm saving 20MB to the hard drive in one thread, and 10MB in the
other, wouldn't it be better to post it all to the same?": It depends on
the OS. Normally, there should be no reduction in time or latency using
separate threads, and depending on other activity and the OS, there
could be an improvement. (If there are several disk requests in
instance, most OS's will optimize the accesses, reordering the requests
to reduce head movement.) The simplest solution would be to try both,
and see which works better on your system.
"How would this work in case of multiple hard drives?": The OS should
be able to do the I/O in parallel, if the requests are to different
drives.
With regards to increasing priority of one or more theads, it's very OS
dependent, but probably worth trying. Unless there's significant CPU
time used in the threads with the higher priority, it shouldn't impact
the user interface—these threads are mostly blocked for I/O,
remember.
Well, my Windows 7 is currently running 950 threads. I don't think that adding another few dozen on would make a significant difference. However, you should definitely be looking at a thread pool or other work-stealing device for this - you shouldn't make new threads just to let them block. If Windows provides asynchronous I/O by default, then use it.

C++: Limiting CPU usage intentionally

At my company, we often test the performance of our USB and FireWire devices under CPU strain.
There is a test code we run that loads the CPU, and it is often used in really simple informal tests to see what happens to our device's performance.
I took a look at the code for this, and its a simple loop that increments a counter and does a calculation based on the new value, storing this result in another variable.
Running a single instance will use 1/X of the CPU, where X is the number of cores.
So, for instance, if we're on a 8-core PC and we want to see how our device runs under 50% CPU usage, we can open four instances of this at once, and so forth...
I'm wondering:
What decides how much of the CPU gets used up? does it just run everything as fast as it can on a single thread in a single threaded application?
Is there a way to voluntarily limit the maximum CPU usage your program can use? I can think of some "sloppy" ways (add sleep commands or something), but is there a way to limit to say, some specified percent of available CPU or something?
CPU quotas on Windows 7 and on Linux.
Also on QNX (i.e. Blackberry Tablet OS) and LynuxWorks
In case of broken links, the articles are named:
Windows -- "CPU rate limits in Windows Server 2008 R2 and Windows 7"
Linux -- "CPU Usage Limiter for Linux"
QNX -- "Adaptive Partitioning"
LynuxWorks - "Partitioning Operating Systems" and "ARINC 653"
The OS usually decides how to schedule processes and on which CPUs they should run. It basically keeps a ready queue for processes which are ready to run (not marked for termination and not blocked waiting for some I/O, event etc.). Whenever a process used up its timeslice or blocks it basically frees a processing core and the OS selects another process to run. Now if you have a process which is always ready to run and never blocks then this process essentially runs whenever it can thus pushing the utilization of a processing unit to a 100%. Of course this is a bit simplified description (there are things like process priorities for example).
There is usually no generic way to achieve this. The OS you are using might offer some mechanism to do this (some kind of CPU quota). You could try and measure how much time has passed vs. how much cpu time your process used up and then put your process to sleep for certain periods to achieve an approximation of desired CPU utilization.
You've essentially answered your own questions!
The key trait of code that burns a lot of CPU is that it never does anything that blocks (e.g. waiting for network or file I/O), and never voluntarily yields its time slice (e.g. sleep(), etc.).
The other trick is that the code must do something that the compiler cannot optimize away. So, most likely your CPU burn code outputs something based on the loop calculation at the end, or is simply compiled without optimization so that the optimizer isn't tempted to remove the useless loop. Since you're trying to load the CPU, there's no sense in optimizing anyways.
As you hypothesized, single threaded code that matches this description will saturate a CPU core unless the OS has more of these processes than it has cores to run them--then it will round-robin schedule them and the utilization of each will be some fraction of 100%.
The issue isn't how much time the CPU spends idle, but rather how long it takes for your code to start executing. Who cares if it's idle or doing low-priority busywork, as long as the latency is low?
Your problem is fundamentally a consequence of using a synthetic benchmark, presumably in an attempt to obtain reproducible results. But synthetic benchmarks tend to produce meaningless results, so reproducibility is moot.
Look at your bug database, find actual customer complaints, and use actual software and test hardware to reproduce a situation that actually made someone dissatisfied. Develop the performance test in parallel with hard, meaningful performance requirements.

Thread limit in Unix before affecting performance

I have some questions regarding threads:
What is the maximum number of threads allowed for a process before it decreases the performance of the application?
If there's a limit, how can this be changed?
Is there an ideal number of threads that should be running in a multi-threaded application? If it depends on what the application is doing, can you cite an example?
What are the factors to consider that affects these performance/thread limit?
This is actually a hard set of questions to which there are no absolute answers, but the following should serve as decent approximations:
It is a function of your application behavior and your runtime environment, and can only be deduced by experimentation. There is usually a threshold after which your performance actually degrades as you increase the number of threads.
Usually, after you find your limits, you have to figure out how to redesign your application such that the cost-per-thread is not as high. (Note that for some domains, you can get better performance by redesigning your algorithm and reducing the number of threads.)
There is no general "ideal" number of threads, but you can sometimes find the optimal number of threads for an application on a specific runtime environment. This is usually done by experimentation, and graphing the results of benchmarks while varying the following:
Number of threads.
Buffer sizes (if the data is not in RAM) incrementing at some reasonable value (e.g., block size, packet size, cache size, etc.)
Varying chunk sizes (if you can process the data incrementally).
Various tuning knobs for the OS or language runtime.
Pinning threads to CPUs to improve locality.
There are many factors that affect thread limits, but the most common ones are:
Per-thread memory usage (the more memory each thread uses, the fewer threads you can spawn)
Context-switching cost (the more threads you use, the more CPU-time is spent switching).
Lock contention (if you rely on a lot of coarse grained locking, the increasing the number of threads simply increases the contention.)
The threading model of the OS (How does it manage the threads? What are the per-thread costs?)
The threading model of the language runtime. (Coroutines, green-threads, OS threads, sparks, etc.)
The hardware. (How many CPUs/cores? Is it hyperthreaded? Does the OS loadbalance the threads appropriately, etc.)
Etc. (there are many more, but the above are the most important ones.)
The answer to your questions 1, 3, and 4 is "it's application dependent". Depending on what your threads do, you may need a different number to maximize your application's efficiency.
As to question 2, there's almost certainly a limit, and it's not necessarily something you can change easily. The number of concurrent threads might be limited per-user, or there might be a maximum number of a allowed threads in the kernel.
There's nothing fixed: it depends what they are doing. Sometimes adding more threads to do asynchronous I/O can increase the performance of another thread with no bad side effects.
This is likely fixed at compile time.
No, it's a process architecture decision. But having at least one listener-scheduler thread besides the one or more threads doing the heavy lifting suggests the number should normally be at least two.
Almost certainly, your ability to really grasp what is going on. Threaded code chokes easily and in the most unexpected ways: making sure the code has no races/deadlocks is hard. Study different ways of handling concurrency, such as shared-nothing (cf. Erlang).
As long as you never have more threads using CPU time than you have cores, you will have optimal performance, but then as soon as you have to wait for I/O There will be unused CPU cycles, so you may want to profile you applications, and see wait portion of the time it spends maxing out the CPU and what portion waiting for RAM, Hard Disk, Network, and other IO, in general if you are waiting for I/O you could have 1 more thread (Provided that you are primarily CPU bound).
For the hard and absolute limit Check out PTHREAD_THREADS_MAX in limits.h this may be what you are looking for. Might be POSIX_THREAD_MAX on some systems.
Any app with more busy threads than the number of processors will cause some overall slowdown. There's an upper limit, but it varies system to system. For some, it used to be 256 and you could recompile the OS to get it a bit higher.
As long as the threads are designed to do separate tasks, then there is not so much issue. However, the problem starts when these threads intersect the resources when locking mechanism should be implemented.