Linux - Limit threads per process

Linux - Limit threads per process - c++

I'ce written a C++ program that does some benchmarking on a couple of algorithms. Some of these algorithms are using other libraries for their calculations. These external libraries (which I don't have control over) use multi-threading, which makes it difficult to get a proper benchmark (some algorithms are single-threaded, some are multi-threaded).
So when doing the benchmarking, I want to limit the threads to 1. Is there anyway I can start a program in Linux and tell it to use a maximum of 1 thread, instead of the default in the external libraries (which is equal to the number of cores)?

I'm not sure it is possible what you ask from an OS perspective [what could the OS do? return failure when the next thread is requested?].
I'd search in the libraries documentation, they will probably provide configuration values for the number of threads they're using.
Alternatively, you can do processor affinity using the taskset command
It is not exactly what you requested, but it ensures that your program will be running on a given CPU (or a set of CPUs). So regardless of the number of threads spawned, the program will be using at any time at most one CPU (or the number of CPUs specified), so the effective parallelism will be controlled.
Alternatively x2 Triggered by the comment from #goocreations... One way of fooling the programs into believing they have a different number of CPUs available is to run them in a virtual machine (e.g. VirtualBox)

Related

Finding out how many threads make sense to create

I have some calculation task on a large amount of the data - so it can be quite easily to parallel. Next question is how many threads does it make sense to create. Of course I can measure time for different number of thread on my machine, but what if a program will be run on different machines, so I can't really make manual measurement. Is just get number of threads from std::thread::hardware_concurrency() good enough, or there are some other ways?

That function (std::thread::hardware_concurrency()) will give you the total core count, including hyperthreading.
If your program does intensive number crunching I would say using only physical cores and setting processor affinity is the best choice.
You can know the current processor topology with hwloc library which works in most platforms.
You may find an comprehensible explanation (though a bit old) here.
If there is lot of I/O then you may run two threads for processor to allow one to process data while other is waiting for input, or one extra thread without affinity so it can take processor time while others are waiting for I/O, but this is a very rough estimation: better measure in your machine.
If you can test in other processors, you may have different strategy for each processor.

Analyzing spikes in performance measurement

I have a set of C++ functions which does some image processing related operation. Generally I see that the final output is delivered in 5-6ms time range. I am measuring the time taken using QueryPerformanceCounter Win32 API. But when running in a continuous loop with 100 images, I see that the performance spikes up to 20ms for some images. My question is how do I go about analyzing such issues. Basically I want to determine whether the spikes are caused due to some delay in this code or whether some other task started running inside the CPU because of which this operation took time. I have tried using GetThreadTimes API to see how much time my thread spent inside CPU but am unable to conclude based on those numbers. What is the standard way to go about troubleshooting these types of issues?

Reason behind sudden spikes during processing could be any of IO, interrupt, scheduled processes etc.
It is very common to see such spikes considering such low latency/processing time operations. IMO you can consider them because of any of the above mentioned reasons (There could be more). Simplest solution is run same experiment with more inputs multiple times and take the average for final consideration.
To answer your question about checking/confirming source of the spike you can try following,
Check variation in images - already ruled out as per your comment
Monitor resource utilization during processing. Check if any resource is choking (% util is simplest way to check and SAR/NMON utility on linux is best with minimal overhead)
Reserve few CPU's on system (CPU Affinity) for your experiment which are dedicated only for your program and no OS task will run on them. Taskset is simplest utility to try out. More details are here.
Run the experiment with this setting and check behavior.

That's a nasty thing you are trying to figure out, I wouldn'd even attempt to, since coming into concrete conlusions is hard.
In general, one should run a loop of many iterations (100 just seems too small I think), and then take the average time for an image to be processed.
That will rule out any unexpected exterior events that may have hurt performance of your program.
A typical way to check if "some other task started running inside the CPU" would be to run your program once and mark the images that produce that spike. Example, image 2, 4, 5, and 67 take too long to be processed. Run your program again some times, and mark again which images produce the spikes.
If the same images produce these spikes, then it's not something caused by another exterior task.

What is the standard way to go about troubleshooting these types of issues?
There are Real Time Operating Systems (RTOS) which guarantee those kind of delays. It is totally different class of operating systems than Windows or Linux.
But still, there are something you can do about your delays even on general purpose OS.
1. Avoid system calls
Once you ask your OS to read or write something to a disk -- there are no guarantees whatever about delays. So, avoid any system functions on you critical path:
even functions like gettimeofday() might cause unpredictable delays, so you should really avoid any system calls in time-critical code;
use another thread to perform IO and pass data via a shared buffer to your critical code.
If your code base is big, tools like strace on Linux or Dr Memory on Windows to trace system calls.
2. Avoid context switches
The multi threading on Windows is preemptive. It means, there is a system scheduler, which might stop your thread any time and schedule another thread on your CPU. As previously, there are RTOSes, which allow to avoid such context switches, but there is something you can do about it:
make sure there is at least one CPU core left for system and other tasks;
bind each of your threads to a dedicated CPU with SetThreadAffinityMask() (Windows) or sched_setaffinity() (Linux) -- this effectively hints system scheduler to avoid scheduling other threads on this CPU;
make sure hardware interrupts go to another CPU; usually interrupts go to CPU 0, so the easiest way would be to bind your thread with CPU 1+;
increase your thread priority, so scheduler less likely to switch your thread with another one.
There are tools like perf (Linux) and Intel VTune (Windows) to confirm there are context switches.
3. Avoid other non-deterministic features
Few more sources of unexpected delays:
disable swap, so you know for sure your thread memory will not be swapped on slow and unpredictable disk drive;
disable CPU turbo boost -- after a high-performance CPU boosts, there is always a slow down, so the CPU stays withing its thermal power (TDP);
disable hyper threading -- from scheduler point of view those are independent CPUs, but in fact performance of each hyper-thread CPU depend on what another thread is doing at the moment.
Hope this helps.

Am I disturbing other programs with OpenMP?

I'm using OpenMP for a loop like this:
#pragma omp parallel for
for (int out = 1; out <= matrix.rows; out++)
{
...
}
I'm doing a lot of computations on a machine with 64 CPUs. This works quite qell but my question is:
Am I disturbing other users on this machine? Usually they only run single thread programms. Will they still run on 100%? Obviously I will disturb other multithreads programms, but will I disturb single thread programs?
If yes, can I prevend this? I think a can set the maximum number of CPUs with omp_set_num_threads. I can set this to 60, but I don't think this is the best solution.
The ideal solution would disturb no other single thread programs but take as much CPUs as possible.

Every multitasking OS has something called a process scheduler. This is an OS component that decides where and when to run each process. Schedulers are usually quite stubborn in the decisions they make but those could often be influenced by various user-supplied policies and hints. The default configuration for almost any scheduler is to try and spread the load over all available CPUs, which often results in processes migrating from one CPU to another. Fortunately, any modern OS except "the most advanced desktop OS" (a.k.a. OS X) supports something called processor affinity. Every process has a set of processors on which it is allowed to execute - the so-called CPU affinity set of that process. By configuring disjoint affinity sets to various processes, those could be made to execute concurrently without stealing CPU time from each other. Explicit CPU affinity is supported on Linux, FreeBSD (with the ULE scheduler), Windows NT (this also includes all desktop versions since Windows XP), and possibly other OSes (but not OS X). Every OS then provides a set of kernel calls to manipulate the affinity and also an instrument to do that without writing a special program. On Linux this is done using the sched_setaffinity(2) system call and the taskset command line instrument. Affinity could also be controlled by creating a cpuset instance. On Windows one uses the SetProcessAffinityMask() and/or SetThreadAffinityMask() and affinities can be set in Task Manager from the context menu for a given process. Also one could specify the desired affinity mask as a parameter to the START shell command when starting new processes.
What this all has to do with OpenMP is that most OpenMP runtimes for the listed OSes support under one form or another ways to specify the desired CPU affinity for each OpenMP thread. The simplest control is the OMP_PROC_BIND environment variable. This is a simple switch - when set to TRUE, it instructs the OpenMP runtime to "bind" each thread, i.e. to give it an affinity set that includes a single CPU only. The actual placement of threads to CPUs is implementation dependent and each implementation provides its own way to control it. For example, the GNU OpenMP runtime (libgomp) reads the GOMP_CPU_AFFINITY environment variable, while the Intel OpenMP runtime (open-source since not long ago) reads the KMP_AFFINITY environment variable.
The rationale here is that you could limit your program's affinity in such a way as to only use a subset of all the available CPUs. The remaining processes would then get predominantly get scheduled to the rest of the CPUs, though this is only guaranteed if you manually set their affinity (which is only doable if you have root/Administrator access since otherwise you can modify the affinity only of processes that you own).
It is worth mentioning that it often (but not always) makes no sense to run with more threads than the number of CPUs in the affinity set. For example, if you limit your program to run on 60 CPUs, then using 64 threads would result in some CPUs being oversubscribed and in timesharing between the threads. This will make some threads run slower than the others. The default scheduling for most OpenMP runtimes is schedule(static) and therefore the total execution time of the parallel region is determined by the execution time of the slowest thread. If one thread timeshares with another one, then both threads will execute slower than those threads that do not timeshare and the whole parallel region would get delayed. Not only this reduces the parallel performance but it also results in wasted cycles since the faster threads would simply wait doing nothing (possibly busy looping at the implicit barrier at the end of the parallel region). The solution is to use dynamic scheduling, i.e.:
#pragma omp parallel for schedule(dynamic,chunk_size)
for (int out = 1; out <= matrix.rows; out++)
{
...
}
where chunk_size is the size of the iteration chunk that each thread gets. The whole iteration space is divided in chunks of chunk_size iterations and are given to the worker threads on a first-come-first-served basis. The chunk size is an important parameter. If it is too low (the default is 1), then there could be a huge overhead from the OpenMP runtime managing the dynamic scheduling. If it is too high, then there might not be enough work available for each thread. It makes no sense to have chunk size bigger than maxtrix.rows / #threads.
Dynamic scheduling allows your program to adapt to the available CPU resources, even if they are not uniform, e.g. if there are other processes running and timesharing with the current one. But it comes with a catch: big system like your 64-core one usually are ccNUMA (cache-coherent non-uniform memory access) systems, which means that each CPU has its own memory block and access to the memory block(s) of other CPU(s) is costly (e.g. takes more time and/or provides less bandwidth). Dynamic scheduling tends to destroy data locality since one could not be sure that a block of memory, which resides on one NUMA, won't get utilised by a thread running on another NUMA node. This is especially important when data sets are large and do not fit in the CPU caches. Therefore YMMV.

Put your process on low priority within the operating system. Use a many resources as you like. If someone else needs those resources the OS will make sure to provide them, because they are on a higher (i.e. normal) priority. If there are no other users you will get all resources.

How many threads can a C++ application create

I'd like to know, how many threads can a C++ application create at most.
Does OS, hardware caps and other factors influence on these bounds?

[C++11: 1.10/1]: [..] Under a hosted implementation, a C++ program can have more than one thread running concurrently. [..] Under a freestanding implementation, it is implementation-defined whether a program can have more than one thread of execution.
[C++11: 30.3/1]: 30.3 describes components that can be used to create and manage threads. [ Note: These threads are intended to map one-to-one with operating system threads. —end note ]
So, basically, it's totally up to the implementation & OS; C++ doesn't care!
It doesn't even list a recommendation in Annex B "Implementation quantities"! (which seems like an omission, actually).

C++ as language does not specify a maximum (or even a minimum beyond the one). The particular implementation can, but I never saw it done directly. The OS also can, but normally just states a lank like limited by system resources. Each thread uses up some nonpaged memory, selector tables, other bound things, so you may run out of that. If you don't the system will become pretty unresponsive if the threads actually do work.
Looking from other side, real parallelism is limited by actual cores in the system, and you shall not have too many threads. Applications that could logically spawn hundreds or thousands usually start using thread pools for good practical reasons.

Basically, there are no limits at your C++ application level. The number of maximum thread is more on the OS level (based on your architecture and memory available).
On Linux, there are no limit on the maximum number of thread per process. The number of thread is limited system wide. You can check the number of maximum allowed threads by doing:
cat /proc/sys/kernel/threads-max
On Windows you can use the testlimit tool to check the maximum number of thread:
http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx
On Mac OS, please read this table to find the number of thread based on your hardware configuration
However, please keep in mind that you are on a multitasking system. The number of threads executed at the same time is limited by the total number of processor cores available. To do more things, the system tries to switch between all theses thread. Each "switch" has a performce (a few milliseconds). If your system is "switching" too much, it won't speed too much time to "work" and your overall system will be slow.

Generally, the limit of number of threads is the amount of memory available, but there have been systems around that have lower limits.
Unless you go mad with creating threads, it's very unlikely it will be a problem to have a limit. Creating more threads is rarely beneficial, once you reach a certain number - that number may be around the same as, or a few times higher than, the number of cores (which for real big, heavy hardware can be a few hundred these days, with 16-core processors and 8 sockets).
Threads that are CPU bound should not be more than the number of processors - nothing good comes from that.
Threads that are doing I/O or otherwise "sitting around waiting" can be higher in numbers - 2-5 per processor core seems reasonable. Given that modern machines have 8 sockets and 16 cores at the higher end of the spectrum, that's still only around 1000 threads.
Sure, it's possible to design, say, a webserver system where each connection is a thread, and the system has 10k or 20k connections active at any given time. But it's probably not the most efficient.

I'd like to know, how many threads can a C++ application create at most.
Implementation/OS-dependent.
Keep in mind that there were no threads in C++ prior to C++11.
Does OS, hardware caps and other factors influence on these bounds?
Yes.
OS might be able limit number of threads a process can create.
OS can limit total number of threads running simultaneously (to prevent fork bombs, etc, linux can definitely do that).
Available physical(and virtual) memory will limit number of threads you can create IF each thread allocates its own stack.
There can be a (possibly hardcoded) limit on how many thread "handles" OS can provide.
Underlying OS/platform might not have threads at all (real-mode compiler for DOS/FreeDOS or something similar).

Apart from the general impracticality of having many more threads than cores, yes, there are limits. For example, a system may keep a unique "process ID" for each thread, and there may be only 65535 of them available. Also, each thread will have its own stack, and those stacks will eventually consume too much memory (you can however adjust the size of each stack when you spawn threads).
Here's an informative article--ignore the fact that it mentions Windows, as the concepts are similar on other common systems: http://blogs.msdn.com/b/oldnewthing/archive/2005/07/29/444912.aspx

There is nothing in the C++ standard that limits number of threads. However, OS will certainly have a hard limit.
Having too many threads decreases the throughput of your application, so it's recommended that you use a thread pool.

Increasing C++ Program CPU Use

I have a program written in C++ that runs a number of for loops per second without using anything that would make it wait for any reason. It consistently uses 2-10% of the CPU. Is there any way to force it to use more of the CPU and do a greater number of calculations without making the program more complex? Additionally, I compile with C::B on a Windows computer. Essentially, I'm asking whether there is a way to make my program faster by increasing usage of CPU, and if so, how.

That depends on why it's only using 10% of the CPU. If it's because you're using a multi-CPU machine and your program is using only one CPU, then no, you will have to introduce concurrency into your code to use that additional horsepower.
If it's being limited by something else (e.g. copying data to and from the disk), then you don't need to focus on CPU, you need to focus on whatever the bottleneck is. Most likely, the limiter will be reading from the disk, which you can improve by using better caching mechanisms.

Assuming your application has the power (PROCESS_SET_INFORMATION access right), you can use SetPriorityClass to bump up your priortiy (to the usual detriment of all other processes, of course).
You can go ABOVE_NORMAL_PRIORITY_CLASS (try this one first), HIGH_PRIORITY_CLASS (be very careful with this one) or REALTIME_PRIORITY_CLASS (I would strongly suggest that you probably shouldn't give this one a shot).
If you try the higher priorities and it's still clocking pretty low, then that's probably because you're not CPU-bound (such as if you're writing data to an output file). If that's the case, you'll probably have to find a way to make yourself CPU bound.
Just keep in mind that doing so may not be necessary (or even desirable). If you're running at a higher priority than other threads and you're still not sucking up a lot of CPU, it's probably because Windows has (most likely, rightfully) decided you don't need it.

It's really not the program's right or responsibility to demand additional resources from the system. That's the OS' job, as resource scheduler.
If it is necessary to use more CPU time than the OS sees fit, you should request that from the OS using the platform-dependent API. In this case, that seems to be something along the lines of SetPriorityClass or SetThreadPriority.

Creating a thread & giving higher priority to the thread might be one way.

If you use C++, consider using Intel Threading Building Block. You can find some examples here.

Some profilers give very nice indications of where bottlenecks in your code are. For example - the CodeAnalyst (for AMD chips only) has the instructions per cycle ratio. I'm sure intel profilers are similar.
As Billy O'Neal says though, if your runnning on an 8-core, being stuck on 10 percent of cpu is about right. If this is your problem then Windows msvc++ has a parallel mode (the parallel patterns library) for the standard algorithms. This can give parallelisation for free if have written your loops the c++ way (its still your responsibility to make sure your loops are thread safe). I've not used the msvc version but the gnu::__parallel_for_each etc work a treat.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js