How many threads can a C++ application create - c++

I'd like to know, how many threads can a C++ application create at most.
Does OS, hardware caps and other factors influence on these bounds?

[C++11: 1.10/1]: [..] Under a hosted implementation, a C++ program can have more than one thread running concurrently. [..] Under a freestanding implementation, it is implementation-defined whether a program can have more than one thread of execution.
[C++11: 30.3/1]: 30.3 describes components that can be used to create and manage threads. [ Note: These threads are intended to map one-to-one with operating system threads. —end note ]
So, basically, it's totally up to the implementation & OS; C++ doesn't care!
It doesn't even list a recommendation in Annex B "Implementation quantities"! (which seems like an omission, actually).

C++ as language does not specify a maximum (or even a minimum beyond the one). The particular implementation can, but I never saw it done directly. The OS also can, but normally just states a lank like limited by system resources. Each thread uses up some nonpaged memory, selector tables, other bound things, so you may run out of that. If you don't the system will become pretty unresponsive if the threads actually do work.
Looking from other side, real parallelism is limited by actual cores in the system, and you shall not have too many threads. Applications that could logically spawn hundreds or thousands usually start using thread pools for good practical reasons.

Basically, there are no limits at your C++ application level. The number of maximum thread is more on the OS level (based on your architecture and memory available).
On Linux, there are no limit on the maximum number of thread per process. The number of thread is limited system wide. You can check the number of maximum allowed threads by doing:
cat /proc/sys/kernel/threads-max
On Windows you can use the testlimit tool to check the maximum number of thread:
http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx
On Mac OS, please read this table to find the number of thread based on your hardware configuration
However, please keep in mind that you are on a multitasking system. The number of threads executed at the same time is limited by the total number of processor cores available. To do more things, the system tries to switch between all theses thread. Each "switch" has a performce (a few milliseconds). If your system is "switching" too much, it won't speed too much time to "work" and your overall system will be slow.

Generally, the limit of number of threads is the amount of memory available, but there have been systems around that have lower limits.
Unless you go mad with creating threads, it's very unlikely it will be a problem to have a limit. Creating more threads is rarely beneficial, once you reach a certain number - that number may be around the same as, or a few times higher than, the number of cores (which for real big, heavy hardware can be a few hundred these days, with 16-core processors and 8 sockets).
Threads that are CPU bound should not be more than the number of processors - nothing good comes from that.
Threads that are doing I/O or otherwise "sitting around waiting" can be higher in numbers - 2-5 per processor core seems reasonable. Given that modern machines have 8 sockets and 16 cores at the higher end of the spectrum, that's still only around 1000 threads.
Sure, it's possible to design, say, a webserver system where each connection is a thread, and the system has 10k or 20k connections active at any given time. But it's probably not the most efficient.

I'd like to know, how many threads can a C++ application create at most.
Implementation/OS-dependent.
Keep in mind that there were no threads in C++ prior to C++11.
Does OS, hardware caps and other factors influence on these bounds?
Yes.
OS might be able limit number of threads a process can create.
OS can limit total number of threads running simultaneously (to prevent fork bombs, etc, linux can definitely do that).
Available physical(and virtual) memory will limit number of threads you can create IF each thread allocates its own stack.
There can be a (possibly hardcoded) limit on how many thread "handles" OS can provide.
Underlying OS/platform might not have threads at all (real-mode compiler for DOS/FreeDOS or something similar).

Apart from the general impracticality of having many more threads than cores, yes, there are limits. For example, a system may keep a unique "process ID" for each thread, and there may be only 65535 of them available. Also, each thread will have its own stack, and those stacks will eventually consume too much memory (you can however adjust the size of each stack when you spawn threads).
Here's an informative article--ignore the fact that it mentions Windows, as the concepts are similar on other common systems: http://blogs.msdn.com/b/oldnewthing/archive/2005/07/29/444912.aspx

There is nothing in the C++ standard that limits number of threads. However, OS will certainly have a hard limit.
Having too many threads decreases the throughput of your application, so it's recommended that you use a thread pool.

Related

How to multithread core-schedule onto different cores (ideally in C++)

I have a large C++11 multithreaded application where the threads are always active, communicating to each other constantly, and should be scheduled on different physical CPUs for reasonable performance.
The default Linux behavior AFAIK is that threads will typically/often get scheduled onto the same CPU, causing horrible performance.
To solve this, I understand how to attach threads to specific physical CPUs in C++, e.g.:
std::cout << "Assign to thread cpu " << cpu << "\n";
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
int rc = pthread_setaffinity_np(thread.native_handle(), sizeof(cpu_set_t), &cpuset);
and can use this to pin to specific CPUs, e.g. attach 4 threads to CPUs 0,2,4,6.
However this approach requires a specific CPU number which is a problem in that there may be many programs running on a host using other CPUs. These might be my program or other programs. As just one example an 8 core machine might have two copies of my 4-threaded application so obviously having both of those two programs pick the same 4 CPUs is a problem.
I'd thus like a way to say "schedule the threads in this set all on different CPUs without caring of the CPU number". Is this possible in C++(11)?
If not, is this possible with numactl or another utility? E.g. I don't want "numactl -C 0,2,4,6" but rather "numactl -C W,X,Y,Z" where the scheduler can pick arbitrary W,X,Y,Z subject to W!=X!=Y!=Z.
I'm most interested in Linux behavior. I cannot change the OS configuration. I don't want the separate applications to cross communicate (nor can they as they might be other applications I do not control.)
Once I have the answer to this, the follow up is how do I modify this to add a e.g. fifth thread I do want to schedule on the same CPU as the first thread?
My problem in a specific Boost ASIO multithreaded application is, that even with a limited number of threads (like ten) on a system with much more cores, the threads get pushed around onto different cores all the time, which seriously reduces performances due to a high number of L1/L2 cache misses.
I have not searched much, yet, but there is a getcpu() system call on Linux, that returns the CPU-ID and NUMA Node-ID of the active thread, that is calling getcpu(). To get a set of unique CPU-IDs, one could try to create all threads, first, then let them all wait for a barrier via pthread_barrier_wait() and after that call getcpu() repeatedly in each thread until the returned values have stabilized. Stability has been reached, when each thread has gotten the same CPU-ID as answer for at least the last 1000 calls to getcpu() AND all the answers to all the different threads are different. It is of extreme importance to use non-blocking techniques like std::atomic values to synchronize during this testing phase. Because, if you wait for some Mutexes instead, the likelyhood is high, that your threads get re-mixed again by the scheduler.
After stability has been reached, each thread just sets its CPU affinity to its current CPU-ID and you are done.
In many cases, where you do not dynamically start and stop a lot of applications, hand-binding the threads to certain Cores might be the easiest solution, though. And if you do start and stop a lot of apps dynamically, the "pick N free cores" algo described above will fail miserably, if there aren't enough free cores left, anyways.

Maximum available threads in Windows system? C++

What is the code in C++ to get maximum number of avalible threads in system?
C++ does not have the concept of a maximum number of threads.
It does have the concept of a thread failing to be created, by raising std::system_error. This can happen for any number of reasons, including your OS deciding it doesn't want to spawn any more threads - either because you've hit a hard or soft limit on thread count, or because it actually cannot create a thread if it wanted (e.g. your address space is consumed).
The actual limit would need to be queried in an OS-specific way, outside the C++ standard. For example, on Linux one could query /proc/sys/kernel/threads-max and any relevant ulimit and compute a possible limit.
On Windows there is no queryable limit, and you are limited by address space. See for example "Does Windows have a limit of 2000 threads per process?" exploring this limitation.
The reason systems don't make this trivial to query is because it should not matter. You will quickly exhaust your usable cores long before you hit any practical limit in thread count. Don't make so many threads!
std::thread::hardware_concurrency()
Returns the number of hardware thread contexts. If this value is not computable or well-defined, an implementation should return 0.
You can however create many more std::thread objects, but only this many threads will execute in parallel at any time.
For OpenMP (OMP) you also have omp_get_max_threads()
Returns an integer that is equal to or greater than the number of threads that would be available if a parallel region without num_threads were defined at that point in the code.

Concurrent Programming act on each element in array

I have a question whch related to parallel programming. If I have a program that acts on each and every elemnt of an array why might it not be advantagous to use all the available processors?
I was thinking maybe because of the significant overhead of setting up and managing multiple threads or if the array size didnt warrant a concurrent solution. Can anyone think of anything else?
Some processors may already be busy doing important things, or you may want to leave spare capacity just in case they need to respond quickly to new workloads. For example, in a desktop system with 8 processors, you may want to leave 1 free to keep the UI responsive, while you fork out 7 "batch-processing" threads on the others. In a non-UI system, you may still want to keep one or more cores listening to OS interrupts or doing network IO.
A particularly frustrating example would be starting a parallel computation on all your cores, finding that you should have tweaked a parameter before launching it, and not being able to interrupt the computation because there is no spare computing power left to allow the UI to respond to your 'cancel' button.
I would have made that array a static variable and according to its size, I would have divided the task and assigned multiple treads to carry the work for each set of elements in the array.
For example, If I have 100 elements in the array. I would have divided it and made sets of 10.
and with 10 different threads I would have carried out my work.
Correct me if I am not getting you.
EDITED:-
The OS already does precisely that for you. It doesn't guarantee that each thread will stay on the same core forever (and in nearly all cases, there's no need for that either), but it does try to keep as many cores busy as possible. Which means giving all available threads their own core as much as possible.
Note:- Direct correlation between program threads and OS threads is not guaranteed, at least according to this for .net : http://msdn.microsoft.com/en-us/library/74169f59.aspx
Hope this make some sense.

What is the maximum number of threads a process can have in windows

In a windows process is there any limit for the threads to be used at a time. If so what is the maximum number of threads that can be used per process?
There is no limit that I know of, but there are two practical limits:
The virtual space for the stacks. For example in 32-bits the virtual space of the process is 4GB, but only about 2G are available for general use. By default each thread will reserve 1MB of stack space, so the top value are 2000 threads. Naturally you can change the size of the stack and make it lower so more threads will fit in (parameter dwStackSize in CreateThread or option /STACK in the linker command). If you use a 64-bits system this limit practically dissapears.
The scheduler overhead. Once you read the thousands of threads, just scheduling them will eat nearly 100% of your CPU time, so they are mostly useless anyway. This is not a hard limit, just your program will be slower and slower the more threads you create.
The actual limit is determined by the amount of available memory in various ways. There is no limit of "you can't have more than this many" of threads or processes in Windows, but there are limits to how much memory you can use within the system, and when that runs out, you can't create more threads.
See this blog by Mark Russinovich:
http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx

Thread limit in Unix before affecting performance

I have some questions regarding threads:
What is the maximum number of threads allowed for a process before it decreases the performance of the application?
If there's a limit, how can this be changed?
Is there an ideal number of threads that should be running in a multi-threaded application? If it depends on what the application is doing, can you cite an example?
What are the factors to consider that affects these performance/thread limit?
This is actually a hard set of questions to which there are no absolute answers, but the following should serve as decent approximations:
It is a function of your application behavior and your runtime environment, and can only be deduced by experimentation. There is usually a threshold after which your performance actually degrades as you increase the number of threads.
Usually, after you find your limits, you have to figure out how to redesign your application such that the cost-per-thread is not as high. (Note that for some domains, you can get better performance by redesigning your algorithm and reducing the number of threads.)
There is no general "ideal" number of threads, but you can sometimes find the optimal number of threads for an application on a specific runtime environment. This is usually done by experimentation, and graphing the results of benchmarks while varying the following:
Number of threads.
Buffer sizes (if the data is not in RAM) incrementing at some reasonable value (e.g., block size, packet size, cache size, etc.)
Varying chunk sizes (if you can process the data incrementally).
Various tuning knobs for the OS or language runtime.
Pinning threads to CPUs to improve locality.
There are many factors that affect thread limits, but the most common ones are:
Per-thread memory usage (the more memory each thread uses, the fewer threads you can spawn)
Context-switching cost (the more threads you use, the more CPU-time is spent switching).
Lock contention (if you rely on a lot of coarse grained locking, the increasing the number of threads simply increases the contention.)
The threading model of the OS (How does it manage the threads? What are the per-thread costs?)
The threading model of the language runtime. (Coroutines, green-threads, OS threads, sparks, etc.)
The hardware. (How many CPUs/cores? Is it hyperthreaded? Does the OS loadbalance the threads appropriately, etc.)
Etc. (there are many more, but the above are the most important ones.)
The answer to your questions 1, 3, and 4 is "it's application dependent". Depending on what your threads do, you may need a different number to maximize your application's efficiency.
As to question 2, there's almost certainly a limit, and it's not necessarily something you can change easily. The number of concurrent threads might be limited per-user, or there might be a maximum number of a allowed threads in the kernel.
There's nothing fixed: it depends what they are doing. Sometimes adding more threads to do asynchronous I/O can increase the performance of another thread with no bad side effects.
This is likely fixed at compile time.
No, it's a process architecture decision. But having at least one listener-scheduler thread besides the one or more threads doing the heavy lifting suggests the number should normally be at least two.
Almost certainly, your ability to really grasp what is going on. Threaded code chokes easily and in the most unexpected ways: making sure the code has no races/deadlocks is hard. Study different ways of handling concurrency, such as shared-nothing (cf. Erlang).
As long as you never have more threads using CPU time than you have cores, you will have optimal performance, but then as soon as you have to wait for I/O There will be unused CPU cycles, so you may want to profile you applications, and see wait portion of the time it spends maxing out the CPU and what portion waiting for RAM, Hard Disk, Network, and other IO, in general if you are waiting for I/O you could have 1 more thread (Provided that you are primarily CPU bound).
For the hard and absolute limit Check out PTHREAD_THREADS_MAX in limits.h this may be what you are looking for. Might be POSIX_THREAD_MAX on some systems.
Any app with more busy threads than the number of processors will cause some overall slowdown. There's an upper limit, but it varies system to system. For some, it used to be 256 and you could recompile the OS to get it a bit higher.
As long as the threads are designed to do separate tasks, then there is not so much issue. However, the problem starts when these threads intersect the resources when locking mechanism should be implemented.