maximum number of the threads created by a process - c++

This might look weird but I'm trying to analyze an old C++ code in our company.
It's a highly multi-threaded code that insanely creates and destroys threads.
For some reasons, I'm interested in the maximum number of the threads that the process creates during its life time.
Let's assume that the process's behavior is deterministic.
How can I count the maximum number of the threads created by a process?
My environment: Windows/VS2010

Related

Maximum available threads in Windows system? C++

What is the code in C++ to get maximum number of avalible threads in system?
C++ does not have the concept of a maximum number of threads.
It does have the concept of a thread failing to be created, by raising std::system_error. This can happen for any number of reasons, including your OS deciding it doesn't want to spawn any more threads - either because you've hit a hard or soft limit on thread count, or because it actually cannot create a thread if it wanted (e.g. your address space is consumed).
The actual limit would need to be queried in an OS-specific way, outside the C++ standard. For example, on Linux one could query /proc/sys/kernel/threads-max and any relevant ulimit and compute a possible limit.
On Windows there is no queryable limit, and you are limited by address space. See for example "Does Windows have a limit of 2000 threads per process?" exploring this limitation.
The reason systems don't make this trivial to query is because it should not matter. You will quickly exhaust your usable cores long before you hit any practical limit in thread count. Don't make so many threads!
std::thread::hardware_concurrency()
Returns the number of hardware thread contexts. If this value is not computable or well-defined, an implementation should return 0.
You can however create many more std::thread objects, but only this many threads will execute in parallel at any time.
For OpenMP (OMP) you also have omp_get_max_threads()
Returns an integer that is equal to or greater than the number of threads that would be available if a parallel region without num_threads were defined at that point in the code.

How many number of threads are OK to use in my c++ application

So my professor once told me in college don't use more threads in your program then your PC physically has.
I'm confused, if you look at my system resource allocation, I'm currently using 4811 threads i.e.,
and I only have 16 thread processor.
So my questions is, can my c++ program significantly exceed the number of threads my computer physically has?
Also, what happens when you exceed the number of threads your computer has, the OS queues them correct ?

How many threads can a C++ application create

I'd like to know, how many threads can a C++ application create at most.
Does OS, hardware caps and other factors influence on these bounds?
[C++11: 1.10/1]: [..] Under a hosted implementation, a C++ program can have more than one thread running concurrently. [..] Under a freestanding implementation, it is implementation-defined whether a program can have more than one thread of execution.
[C++11: 30.3/1]: 30.3 describes components that can be used to create and manage threads. [ Note: These threads are intended to map one-to-one with operating system threads. —end note ]
So, basically, it's totally up to the implementation & OS; C++ doesn't care!
It doesn't even list a recommendation in Annex B "Implementation quantities"! (which seems like an omission, actually).
C++ as language does not specify a maximum (or even a minimum beyond the one). The particular implementation can, but I never saw it done directly. The OS also can, but normally just states a lank like limited by system resources. Each thread uses up some nonpaged memory, selector tables, other bound things, so you may run out of that. If you don't the system will become pretty unresponsive if the threads actually do work.
Looking from other side, real parallelism is limited by actual cores in the system, and you shall not have too many threads. Applications that could logically spawn hundreds or thousands usually start using thread pools for good practical reasons.
Basically, there are no limits at your C++ application level. The number of maximum thread is more on the OS level (based on your architecture and memory available).
On Linux, there are no limit on the maximum number of thread per process. The number of thread is limited system wide. You can check the number of maximum allowed threads by doing:
cat /proc/sys/kernel/threads-max
On Windows you can use the testlimit tool to check the maximum number of thread:
http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx
On Mac OS, please read this table to find the number of thread based on your hardware configuration
However, please keep in mind that you are on a multitasking system. The number of threads executed at the same time is limited by the total number of processor cores available. To do more things, the system tries to switch between all theses thread. Each "switch" has a performce (a few milliseconds). If your system is "switching" too much, it won't speed too much time to "work" and your overall system will be slow.
Generally, the limit of number of threads is the amount of memory available, but there have been systems around that have lower limits.
Unless you go mad with creating threads, it's very unlikely it will be a problem to have a limit. Creating more threads is rarely beneficial, once you reach a certain number - that number may be around the same as, or a few times higher than, the number of cores (which for real big, heavy hardware can be a few hundred these days, with 16-core processors and 8 sockets).
Threads that are CPU bound should not be more than the number of processors - nothing good comes from that.
Threads that are doing I/O or otherwise "sitting around waiting" can be higher in numbers - 2-5 per processor core seems reasonable. Given that modern machines have 8 sockets and 16 cores at the higher end of the spectrum, that's still only around 1000 threads.
Sure, it's possible to design, say, a webserver system where each connection is a thread, and the system has 10k or 20k connections active at any given time. But it's probably not the most efficient.
I'd like to know, how many threads can a C++ application create at most.
Implementation/OS-dependent.
Keep in mind that there were no threads in C++ prior to C++11.
Does OS, hardware caps and other factors influence on these bounds?
Yes.
OS might be able limit number of threads a process can create.
OS can limit total number of threads running simultaneously (to prevent fork bombs, etc, linux can definitely do that).
Available physical(and virtual) memory will limit number of threads you can create IF each thread allocates its own stack.
There can be a (possibly hardcoded) limit on how many thread "handles" OS can provide.
Underlying OS/platform might not have threads at all (real-mode compiler for DOS/FreeDOS or something similar).
Apart from the general impracticality of having many more threads than cores, yes, there are limits. For example, a system may keep a unique "process ID" for each thread, and there may be only 65535 of them available. Also, each thread will have its own stack, and those stacks will eventually consume too much memory (you can however adjust the size of each stack when you spawn threads).
Here's an informative article--ignore the fact that it mentions Windows, as the concepts are similar on other common systems: http://blogs.msdn.com/b/oldnewthing/archive/2005/07/29/444912.aspx
There is nothing in the C++ standard that limits number of threads. However, OS will certainly have a hard limit.
Having too many threads decreases the throughput of your application, so it's recommended that you use a thread pool.

Simple multi-threading confusion for C++

I am developing a C++ application in Qt.
I have a very basic doubt, please forgive me if this is too stupid...
How many threads should I create to divide a task amongst them for minimum time?
I am asking this because my laptop is 3rd gen i5 processor (3210m). So since it is dual core & NO_OF_PROCESSORS environment variable is showing me 4. I had read in an article that dynamic memory for an application is only available for that processor which launched that application. So should I create only 1 thread (since env variable says 4 processors) or 2 threads (since my processor is dual core & env variable might be suggesting the no of cores) or 4 threads (if that article was wrong)?
Please forgive me since I am a beginner level programmer trying to learn Qt.
Thank You :)
Although hyperthreading is somewhat of a lie (you're told that you have 4 cores, but you really only have 2 cores, and another two that only run on what resources the former two don't use, if there's such a thing), the correct thing to do is still to use as many threads as NO_OF_PROCESSORS tells you.
Note that Intel isn't the only one lying to you, it's even worse on recent AMD processors where you have 6 alleged "real" cores, but in reality only 4 of them, with resources shared among them.
However, most of the time, it just more or less works out. Even in absence of explicitly blocking a thread (on a wait function or a blocking read), there's always a point where a core is stalled, for example in accessing memory due to a cache miss, which gives away resources that can be used by the hyperthreaded core.
Therefore, if you have a lot of work to do, and you can parallelize it nicely, you should really have as many workers as there are advertized cores (whether they're "real" or "hyper"). This way, you make maximum use of the available processor resources.
Ideally, one would create worker threads early at application startup, and have a task queue to hand tasks to workers. Since synchronization is often non-neglegible, the task queue should be rather "coarse". There is a tradeoff in maximum core usage and synchronization overhead.
For example, if you have 10 million elements in an array to process, you might push tasks that refer to 100,000 or 200,000 consecutive elements (you will not want to push 10 million tasks!). That way, you make sure that no cores stay idle on the average (if one finishes earlier, it pulls another task instead of doing nothing) and you only have a hundred or so synchronizations, the overhead of which is more or less neglegible.
If tasks involve file/socket reads or other things that can block for indefinite time, spawning another 1-2 threads is often no mistake (takes a bit of experimentation).
This totally depends on your workload, if you have a workload which is very cpu intensive you should stay closer to the number of threads your cpu has(4 in your case - 2 core * 2 for hyperthreading). A small oversubscription might be also be ok, as that can compensate for times where one of your threads waits for a lock or something else.
On the other side, if your application is not cpu dependent and is mostly waiting, you can even create more threads than your cpu count. You should however notice that thread creation can be quite an overhead. The only solution is to measure were your bottleneck is and optimize in that direction.
Also note that if you are using c++11 you can use std::thread::hardware_concurrency to get a portable way to determine the number of cpu cores you have.
Concerning your question about dynamic memory, you must have misunderstood something there.Generally all threads you create can access the memory you created in your application. In addition, this has nothing to do with C++ and is out of the scope of the C++ standard.
NO_OF_PROCESSORS shows 4 because your CPU has Hyper-threading. Hyper-threading is the Intel trademark for tech that enables a single core to execute 2 threads of the same application more or less at the same time. It work as long as e.g. one thread is fetching data and the other one accessing the ALU. If both need the same resource and instructions can't be reordered, one thread will stall. This is the reason you see 4 cores, even though you have 2.
That dynamic memory is only available to one of the Cores is IMO not quite right, but register contents and sometimes cache content is. Everything that resides in the RAM should be available to all CPUs.
More threads than CPUs can help, depending on how you operating systems scheduler works / how you access data etc. To find that you'll have to benchmark your code. Everything else will just be guesswork.
Apart from that, if you're trying to learn Qt, this is maybe not the right thing to worry about...
Edit:
Answering your question: We can't really tell you how much slower/faster your program will run if you increase the number of threads. Depending on what you are doing this will change. If you are e.g. waiting for responses from the network you could increase the number of threads much more. If your threads are all using the same hardware 4 threads might not perform better than 1. The best way is to simply benchmark your code.
In an ideal world, if you are 'just' crunching numbers should not make a difference if you have 4 or 8 threads running, the net time should be the same (neglecting time for context switches etc.) just the response time will differ. The thing is that nothing is ideal, we have caches, your CPUs all access the same memory over the same bus, so in the end they compete for access to resources. Then you also have an operating system that might or might not schedule a thread/process at a given time.
You also asked for an Explanation of synchronization overhead: If all your threads access the same data structures, you will have to do some locking etc. so that no thread accesses the data in an invalid state while it is being updated.
Assume you have two threads, both doing the same thing:
int sum = 0; // global variable
thread() {
int i = sum;
i += 1;
sum = i;
}
If you start two threads doing this at the same time, you can not reliably predict the output: It might happen like this:
THREAD A : i = sum; // i = 0
i += 1; // i = 1
**context switch**
THREAD B : i = sum; // i = 0
i += 1; // i = 1
sum = i; // sum = 1
**context switch**
THREAD A : sum = i; // sum = 1
In the end sum is 1, not 2 even though you started the thread twice.
To avoid this you have to synchronize access to sum, the shared data. Normally you would do this by blocking access to sum as long as needed. Synchronization overhead is the time that threads would be waiting until the resource is unlocked again, doing nothing.
If you have discrete work packages for each thread and no shared resources you should have no synchronization overhead.
The easiest way to get started with dividing work among threads in Qt is to use the Qt Concurrent framework. Example: You have some operation that you want to perform on every item in a QList (pretty common).
void operation( ItemType & item )
{
// do work on item, changing it in place
}
QList<ItemType> seq; // populate your list
// apply operation to every member of seq
QFuture<void> future = QtConcurrent::map( seq, operation );
// if you want to wait until all operations are complete before you move on...
future.waitForFinished();
Qt handles the threading automatically...no need to worry about it. The QFuture documenation describes how you can handle the map completion asymmetrically with signals and slots if you need to do that.

How much memory does a thread consume when first created?

I understand that creating too many threads in an application isn't being what you might call a "good neighbour" to other running processes, since cpu and memory resources are consumed even if these threads are in an efficient sleeping state.
What I'm interested in is this: How much memory (win32 platform) is being consumed by a sleeping thread?
Theoretically, I'd assume somewhere in the region of 1mb (since this is the default stack size), but I'm pretty sure it's less than this, but I'm not sure why.
Any help on this will be appreciated.
(The reason I'm asking is that I'm considering introducing a thread-pool, and I'd like to understand how much memory I can save by creating a pool of 5 threads, compared to 20 manually created threads)
I have a server application which is heavy in thread usage, it uses a configurable thread pool which is set up by the customer, and in at least one site it has 1000+ threads, and when started up it uses only 50 MB. The reason is that Windows reserves 1MB for the stack (it maps its address space), but it is not necessarily allocated in the physical memory, only a smaller part of it. If the stack grows more than that a page fault is generated and more physical memory is allocated. I don't know what the initial allocation is, but I would assume it's equal to the page granularity of the system (usually 64 KB). Of course, the thread would also use a little more memory for other things when created (TLS, TSS, etc), but my guess for the total would be about 200 KB. And bear in mind that any memory that is not frequently used would be unloaded by the virtual memory manager.
Adding to Fabios comments:
Memory is your second concern, not your first. The purpose of a threadpool is usually to constrain the context switching overhead between threads that want to run concurrently, ideally to the number of CPU cores available.
A context switch is very expensive, often quoted at a few thousand to 10,000+ CPU cycles.
A little test on WinXP (32 bit) clocks in at about 15k private bytes per thread (999 threads created). This is the initial commited stack size, plus any other data managed by the OS.
If you're using Vista or Win2k8 just use the native Win32 threadpool API. Let it figure out the sizing. I'd also consider partitioning types of workloads e.g. CPU intensive vs. Disk I/O into different pools.
MSDN Threadpool API docs
http://msdn.microsoft.com/en-us/library/ms686766(VS.85).aspx
I think you'd have a hard time detecting any impact of making this kind of a change to working code - 20 threads down to 5. And then add on the added complexity (and overhead) of managing the thread pool. Maybe worth considering on an embedded system, but Win32?
And you can set the stack size to whatever you want.
This depends highly on the system:
But usually, each processes is independent. Usually the system scheduler makes sure that each processes gets equal access to the available processor. Thus a multi threaded application time is multiplexed between the available threads.
Memory allocated to a thread will affect the memory available to the processes but not the memory available to other processes. A good OS will page out unused stack space so it is not in physical memory. Though if your threads allocate enough memory while live you could cause thrashing as each processor's memory is paged to/from secondary device.
I doubt a sleeping thread has any (very little) impact on the system.
It is not using any CPU
Any memory it is using can be paged out to a secondary device.
I guess this can be measured quite easily.
Get the amount of resources used by the system before creating a thread
Create a thread with default system values (default heap size and others)
Get the amount of resources after creating a thread and make the difference (with step 1).
Note that some threads need to be specified different values than the default ones.
You can try and find an average memory use by creating various number of threads (step 2).
The memory allocated by the OS when creating a thread consists of threads local data: TCB TLS ...
From wikipedia: "Threads do not own resources except for a stack, a copy of the registers including the program counter, and thread-local storage (if any)."