Thread limit in Unix before affecting performance - c++

I have some questions regarding threads:
What is the maximum number of threads allowed for a process before it decreases the performance of the application?
If there's a limit, how can this be changed?
Is there an ideal number of threads that should be running in a multi-threaded application? If it depends on what the application is doing, can you cite an example?
What are the factors to consider that affects these performance/thread limit?

This is actually a hard set of questions to which there are no absolute answers, but the following should serve as decent approximations:
It is a function of your application behavior and your runtime environment, and can only be deduced by experimentation. There is usually a threshold after which your performance actually degrades as you increase the number of threads.
Usually, after you find your limits, you have to figure out how to redesign your application such that the cost-per-thread is not as high. (Note that for some domains, you can get better performance by redesigning your algorithm and reducing the number of threads.)
There is no general "ideal" number of threads, but you can sometimes find the optimal number of threads for an application on a specific runtime environment. This is usually done by experimentation, and graphing the results of benchmarks while varying the following:
Number of threads.
Buffer sizes (if the data is not in RAM) incrementing at some reasonable value (e.g., block size, packet size, cache size, etc.)
Varying chunk sizes (if you can process the data incrementally).
Various tuning knobs for the OS or language runtime.
Pinning threads to CPUs to improve locality.
There are many factors that affect thread limits, but the most common ones are:
Per-thread memory usage (the more memory each thread uses, the fewer threads you can spawn)
Context-switching cost (the more threads you use, the more CPU-time is spent switching).
Lock contention (if you rely on a lot of coarse grained locking, the increasing the number of threads simply increases the contention.)
The threading model of the OS (How does it manage the threads? What are the per-thread costs?)
The threading model of the language runtime. (Coroutines, green-threads, OS threads, sparks, etc.)
The hardware. (How many CPUs/cores? Is it hyperthreaded? Does the OS loadbalance the threads appropriately, etc.)
Etc. (there are many more, but the above are the most important ones.)

The answer to your questions 1, 3, and 4 is "it's application dependent". Depending on what your threads do, you may need a different number to maximize your application's efficiency.
As to question 2, there's almost certainly a limit, and it's not necessarily something you can change easily. The number of concurrent threads might be limited per-user, or there might be a maximum number of a allowed threads in the kernel.

There's nothing fixed: it depends what they are doing. Sometimes adding more threads to do asynchronous I/O can increase the performance of another thread with no bad side effects.
This is likely fixed at compile time.
No, it's a process architecture decision. But having at least one listener-scheduler thread besides the one or more threads doing the heavy lifting suggests the number should normally be at least two.
Almost certainly, your ability to really grasp what is going on. Threaded code chokes easily and in the most unexpected ways: making sure the code has no races/deadlocks is hard. Study different ways of handling concurrency, such as shared-nothing (cf. Erlang).

As long as you never have more threads using CPU time than you have cores, you will have optimal performance, but then as soon as you have to wait for I/O There will be unused CPU cycles, so you may want to profile you applications, and see wait portion of the time it spends maxing out the CPU and what portion waiting for RAM, Hard Disk, Network, and other IO, in general if you are waiting for I/O you could have 1 more thread (Provided that you are primarily CPU bound).
For the hard and absolute limit Check out PTHREAD_THREADS_MAX in limits.h this may be what you are looking for. Might be POSIX_THREAD_MAX on some systems.

Any app with more busy threads than the number of processors will cause some overall slowdown. There's an upper limit, but it varies system to system. For some, it used to be 256 and you could recompile the OS to get it a bit higher.

As long as the threads are designed to do separate tasks, then there is not so much issue. However, the problem starts when these threads intersect the resources when locking mechanism should be implemented.

Related

What's the "real world" performance improvement for multithreading I can expect?

I'm programming a recursive tree search with multiple branches and works fine. To speed up I'm implementing a simple multithreading: I distribute the search into main branches and scatter them among the threads. Each thread doesn't have to interact with the others, and when a solve is found I add it to a common std::vector using a mutex this way:
if (CubeTest.IsSolved())
{ // Solve algorithm found
std::lock_guard<std::mutex> guard(SearchMutex); // Thread safe code
Solves.push_back(Alg); // Add the solve
}
I don't allocate variables in dynamic store (heap) with new and delete, since the memory needs are small.
The maximum number of threads I use is the quantity I get from: std::thread::hardware_concurrency()
I did some tests, always the same search but changing the amount or threads used, and I found things that I don't expected.
I know that if you double the amount of threads (if the processor has enougth capacity) you can't expect to double the performance, because of context switching and things like that.
For example, I have an old Intel Xeon X5650 with 6 cores / 12 threads. If I execute my code, until the sixth thread things are as expected, but if I use an additional thread the performace is worst. Using more threads increase the performace very little, to the point that the use of all avaliable threads (12) barely compensates for the use of only 6:
Threads vs processing time chart for Xeon X5650:
(I repeat the test several times and I show the average times of all the tests).
I repeat the tests in other computer with an Intel i7-4600U (2 cores / 4 threads) and I found this:
Threads vs processing time chart for i7-4600U:
I understand that with less cores the performance gain using more threads is worst.
I think also that when you start to use the second thread in the same core the performance is penalized in some way. Am I right? How can I improve the performance in this situation?
So my question is if this performance gains for multithreading is what I can expect in the real world, or on the other hand, this numbers are telling me that I'm doing things wrong and I should learn more about mutithreading programming.
What's the “real world” performance improvement for multithreading I can expect?
It depends on many factors. In general, the most optimistic improvement that one can hope for is reduction of runtime by factor of number of cores1. In most cases this is unachievable because of the need for threads to synchronise with one another.
In worst case, not only is there no improvement due to lack of parallelism, but also the overhead of synchronisation as well as cache contention can make the runtime much worse than the single threaded program.
Peak memory use often increases linearly by number of threads because each thread needs to operate on data of their own.
Total CPU time usage, and therefore energy use also increases due to extra time spent on synchronisation. This is relevant to systems that operate on battery power as well as those that have poor heat management (both apply to phones and laptops).
Binary size would be marginally larger due to extra code that deals with threads.
1 Whether you get all of the performance out of "logical" cores i.e. "hyper threading" or "clustered multi threading" also depends on many factors. Often, one executes the same function in all threads, in which case they tend to use the same parts of the CPU, in which case sharing the core with multiple threads doesn't necessarily yield benefit.
A CPU which uses hyperthreading claims to be able to execute two threads simultaneously on one core. But actually it doesn't. It just pretends to be able to do that. Internally it performs preemptive multitasking: Execute a bit of thread A, then switch to thread B, execute a bit of B, back to A and so on.
So what's the point of hyperthreading at all?
The thread switches inside the CPU are faster than thread switches managed by the thread scheduler of the operating system. So the performance gains are mostly through avoiding overhead of thread switches. But it does not allow the CPU core to perform more operations than it did before.
Conclusion: The performance gain you can expect from concurrency depend on the number of physical cores of the CPU, not logical cores.
Also keep in mind that thread synchronization methods like mutexes can become pretty expensive. So the less locking you can get away with the better. When you have multiple threads filling the same result set, then it can sometimes be better to let each thread build their own result set and then merge those sets later when all threads are finished.

Multithreading crowds out other processes

I have added multithreading to a raytracer I am writing, and while it does run much faster now, when it's running, my computer is almost unusably slow. Obviously I want to use all my PC's compute power, but I don't want it to prevent any other application from getting access to the CPUs.
I thought about having the threads sleep, but unless they all sleep at the same time, then the other threads would just eat up the extra time. Also, I don't necessarily want to give up a certain percentage of available compute power if I'm not going to use it.
Also, (This is not my official question) I've noticed that for some reason the first thread launched does more work than the second, and the second more than the third, and so on until like the last 5 threads (out of 32) won't actually get a crack at any work, despite the fact that there's plenty to go a around (there's at least 0.5M work items for them to chew through). If someone would like to venture a guess in the comments, it would be appreciated.
If you use the standard threads, you could try to use thread::hardware_concurrency to find out an estimate of the maximul number of threads that are really supported by hardware, in order not to overload your cpu.
If it returns 0 the information is not available. In other cases you could limit yourself to this number or a little bit below (thinking that other processes might use these as well).
If limiting the number of threads does not improve responsiveness, you can also consider calling from time to time this_thread::yield() to give opportunity to reschedule threads. But depending on the kind of job and synchronisation you use, this second alternative might decrease performance.
As requested, my comment as an answer:
It sounds like you've oversubscribed your poor CPU. Try reducing the number of threads?
If there's significantly more threads than hardware cores, a lot of time is going to be wasted switching between threads, scheduling them in the OS, and in contention over shared variables. It would also cause the general slowdown of the other running programs, because they have to contend with the high number of threads from your program (which by default all have the same priority as the other programs' threads in the eyes of the OS scheduler).

Will multithreading improve performance significantly if I have a fixed amount of calculations that are independet from each other?

I am programming a raycasting game engine.
Each ray can be calculated without knowing anything about the other rays (I'm only calculating distances).
Since there is no waiting time between calculations, I wonder whether it's worth the effort to make the ray calculations multithreaded or not.
Is it likely that there will be a performance boost?
Mostly likely multi-threading will improve performance if done correctly. The way you've stated your problem, it is a perfect candidate for multi-threading since the computations are independent, reducing the need for coordination between threads to a minimum.
Some reasons you still might not get a speed up, or may not get the full speed up you expect could include:
1) The bottleneck may not be on-die CPU execution resources (e.g., ALU-bound operations), but rather something shared like memory or shared LLC bandwidth.
For example, on some architectures, a single thread may be able to saturate memory bandwidth, so adding more cores may not help. A more common case is that a single core can saturate some fraction, 1/N < 1 of main memory bandwidth, and this value is larger than 1/C where C is the core count. For instance, on a 4 core box, one core may be able to consume 50% of the bandwidth. Then, for a memory-bound computation, you'll get good scaling to 2 cores (using 100% of bandwidth), but little to none above that.
Other resources which are shared among cores include disk and network IO, GPU, snoop bandwidth, etc. If you have a hyper-threaded platform, this list increases to include all levels of cache and ALU resources for logical cores sharing the same physical core.
2) Contention "in practice" between operations which are "theoretically" independent.
You mention that your operations are independent. Typically this means that they are logically independent - they don't share any data (other than perhaps immutable input) and they can write to separate output areas. That doesn't exclude the possibility, however, than any given implementation actually has some hidden sharing going on.
One classic example is false-sharing - where independent variables fall in the same cache line, so logically independent writes to different variables from different threads end up thrashing the cache line between cores.
Another example, frequently encountered in practice, is contention via library - if your routines use malloc heavily, you may find that all the threads spend most of their time waiting on a lock inside the allocator as malloc is shared resource. This can be remedied by reducing reliance on malloc (perhaps via fewer, larger mallocs) or with a good concurrent malloc such as hoard or tcmalloc.
3) Implementation of the distribution and collection of the computation across threads may overwhelm the advantage you get from multiple threads. For example, if you spin up a new thread for every individual ray, the thread creation overhead would dominate your runtime and you would likely see a negative benefit. Even if you use a thread-pool of persistent threads, choosing a "work unit" that is too fine grained will impose a lot of coordination overhead which may eliminate your benefits.
Similarly, if you have to copy the input data to and from the worker threads, you may not see the scaling you expect. Where possible, use pass-by-reference for read-only data.
4) You don't have more than 1 core, or you do have more than 1 core but they are already occupied running other threads or processes. In these cases, the effort to coordinate multiple threads is pure overhead.
In general, it depends. Given that the calculations are independent, it sounds like this is a good candidate for potential performance improvements due to threading. Ray calculations typically can benefit from this.
However, there are many other factors, such as memory access requirements, as well as the underlying system on which this runs, which will have a tremendous impact on this. It's often possible to have multithreaded versions run slower than single threaded versions if not written correctly, so profiling is the only way to answer this definitively.
Probably yes, multithreading (e.g. with pthreads) could improve performance; but you surely want to benchmark (and you might be disappointed if your program is memory bound, not CPU bound). And you could also consider OpenCL (to run some regular numeric computations on the GPGPU) and OpenMP (to explicitly ask the compiler, thru pragmas, to parallelize some of your code).
Maybe Open-MPI might be considered to run on several communicating processes. And if you are brave (or crazy) you could mix several approaches.
In reality, it depends upon the algorithm and the system (both hardware and operating system), and you should benchmark (e.g. some micro-prototype related to your needs).
If on some particular system the bottleneck is the memory bandwidth (not the CPU), multi-threading or multi-processing won't help much (and probably could degrade performance).
Also, the cost of synchronization may vary widely (e.g. locking a mutex can be very fast on some systems and 50x slower on others).
Very likely. Independent calculations are a perfect candidate for parallelization. In the case of raycasting, there is so many of them that they would spread nicely across as many parallel threads as the hardware permits.
An unexpected bottleneck for calculations that would otherwise have perfect data-independence can be concurrent writes to nearby locations (false sharing of cache lines).

How many threads can a C++ application create

I'd like to know, how many threads can a C++ application create at most.
Does OS, hardware caps and other factors influence on these bounds?
[C++11: 1.10/1]: [..] Under a hosted implementation, a C++ program can have more than one thread running concurrently. [..] Under a freestanding implementation, it is implementation-defined whether a program can have more than one thread of execution.
[C++11: 30.3/1]: 30.3 describes components that can be used to create and manage threads. [ Note: These threads are intended to map one-to-one with operating system threads. —end note ]
So, basically, it's totally up to the implementation & OS; C++ doesn't care!
It doesn't even list a recommendation in Annex B "Implementation quantities"! (which seems like an omission, actually).
C++ as language does not specify a maximum (or even a minimum beyond the one). The particular implementation can, but I never saw it done directly. The OS also can, but normally just states a lank like limited by system resources. Each thread uses up some nonpaged memory, selector tables, other bound things, so you may run out of that. If you don't the system will become pretty unresponsive if the threads actually do work.
Looking from other side, real parallelism is limited by actual cores in the system, and you shall not have too many threads. Applications that could logically spawn hundreds or thousands usually start using thread pools for good practical reasons.
Basically, there are no limits at your C++ application level. The number of maximum thread is more on the OS level (based on your architecture and memory available).
On Linux, there are no limit on the maximum number of thread per process. The number of thread is limited system wide. You can check the number of maximum allowed threads by doing:
cat /proc/sys/kernel/threads-max
On Windows you can use the testlimit tool to check the maximum number of thread:
http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx
On Mac OS, please read this table to find the number of thread based on your hardware configuration
However, please keep in mind that you are on a multitasking system. The number of threads executed at the same time is limited by the total number of processor cores available. To do more things, the system tries to switch between all theses thread. Each "switch" has a performce (a few milliseconds). If your system is "switching" too much, it won't speed too much time to "work" and your overall system will be slow.
Generally, the limit of number of threads is the amount of memory available, but there have been systems around that have lower limits.
Unless you go mad with creating threads, it's very unlikely it will be a problem to have a limit. Creating more threads is rarely beneficial, once you reach a certain number - that number may be around the same as, or a few times higher than, the number of cores (which for real big, heavy hardware can be a few hundred these days, with 16-core processors and 8 sockets).
Threads that are CPU bound should not be more than the number of processors - nothing good comes from that.
Threads that are doing I/O or otherwise "sitting around waiting" can be higher in numbers - 2-5 per processor core seems reasonable. Given that modern machines have 8 sockets and 16 cores at the higher end of the spectrum, that's still only around 1000 threads.
Sure, it's possible to design, say, a webserver system where each connection is a thread, and the system has 10k or 20k connections active at any given time. But it's probably not the most efficient.
I'd like to know, how many threads can a C++ application create at most.
Implementation/OS-dependent.
Keep in mind that there were no threads in C++ prior to C++11.
Does OS, hardware caps and other factors influence on these bounds?
Yes.
OS might be able limit number of threads a process can create.
OS can limit total number of threads running simultaneously (to prevent fork bombs, etc, linux can definitely do that).
Available physical(and virtual) memory will limit number of threads you can create IF each thread allocates its own stack.
There can be a (possibly hardcoded) limit on how many thread "handles" OS can provide.
Underlying OS/platform might not have threads at all (real-mode compiler for DOS/FreeDOS or something similar).
Apart from the general impracticality of having many more threads than cores, yes, there are limits. For example, a system may keep a unique "process ID" for each thread, and there may be only 65535 of them available. Also, each thread will have its own stack, and those stacks will eventually consume too much memory (you can however adjust the size of each stack when you spawn threads).
Here's an informative article--ignore the fact that it mentions Windows, as the concepts are similar on other common systems: http://blogs.msdn.com/b/oldnewthing/archive/2005/07/29/444912.aspx
There is nothing in the C++ standard that limits number of threads. However, OS will certainly have a hard limit.
Having too many threads decreases the throughput of your application, so it's recommended that you use a thread pool.

the meaning of multiple threading on single core cpu

Once I thought the only occasion multiple threads should be used is when IO processing is needed.
But I heard it's also useful without IO processing. Because it helps to occupy more CPU resources.
In my understanding, this would be
the process with more threads are given more CPU time.
Is this why multiple threads help improve performance even on single core?
One possible reason you can see greater performance from multiple threads on a single CPU is that CPUs tend to be really good at instruction reordering and making use of instruction-level parallelism. Threads have fewer data and control dependencies with respect to one another than any two sequential instructions within a single thread, and therefore they offer more possibilities for the CPU and OS-level schedulers and re-ordering mechanisms to be very clever.
Don't forget that things like "reads and writes in memory" are still "I/O" when viewed in a particular way. These are relatively slow operations, and much of the pipelining in modern CPUs is used to hide memory latency - having multiple threads executing at once can be useful for filling up time that would otherwise have to be filled with delay slots where there are data hazards within a single thread.
That said, threads are often not a good solution to increase performance, and can have precisely the opposite effect. It can be very easy to saturate all available memory bandwidth using a single thread on some problems.