How much memory does a thread consume when first created?

How much memory does a thread consume when first created? - c++

I understand that creating too many threads in an application isn't being what you might call a "good neighbour" to other running processes, since cpu and memory resources are consumed even if these threads are in an efficient sleeping state.
What I'm interested in is this: How much memory (win32 platform) is being consumed by a sleeping thread?
Theoretically, I'd assume somewhere in the region of 1mb (since this is the default stack size), but I'm pretty sure it's less than this, but I'm not sure why.
Any help on this will be appreciated.
(The reason I'm asking is that I'm considering introducing a thread-pool, and I'd like to understand how much memory I can save by creating a pool of 5 threads, compared to 20 manually created threads)

I have a server application which is heavy in thread usage, it uses a configurable thread pool which is set up by the customer, and in at least one site it has 1000+ threads, and when started up it uses only 50 MB. The reason is that Windows reserves 1MB for the stack (it maps its address space), but it is not necessarily allocated in the physical memory, only a smaller part of it. If the stack grows more than that a page fault is generated and more physical memory is allocated. I don't know what the initial allocation is, but I would assume it's equal to the page granularity of the system (usually 64 KB). Of course, the thread would also use a little more memory for other things when created (TLS, TSS, etc), but my guess for the total would be about 200 KB. And bear in mind that any memory that is not frequently used would be unloaded by the virtual memory manager.

Adding to Fabios comments:
Memory is your second concern, not your first. The purpose of a threadpool is usually to constrain the context switching overhead between threads that want to run concurrently, ideally to the number of CPU cores available.
A context switch is very expensive, often quoted at a few thousand to 10,000+ CPU cycles.
A little test on WinXP (32 bit) clocks in at about 15k private bytes per thread (999 threads created). This is the initial commited stack size, plus any other data managed by the OS.

If you're using Vista or Win2k8 just use the native Win32 threadpool API. Let it figure out the sizing. I'd also consider partitioning types of workloads e.g. CPU intensive vs. Disk I/O into different pools.
MSDN Threadpool API docs
http://msdn.microsoft.com/en-us/library/ms686766(VS.85).aspx

I think you'd have a hard time detecting any impact of making this kind of a change to working code - 20 threads down to 5. And then add on the added complexity (and overhead) of managing the thread pool. Maybe worth considering on an embedded system, but Win32?
And you can set the stack size to whatever you want.

This depends highly on the system:
But usually, each processes is independent. Usually the system scheduler makes sure that each processes gets equal access to the available processor. Thus a multi threaded application time is multiplexed between the available threads.
Memory allocated to a thread will affect the memory available to the processes but not the memory available to other processes. A good OS will page out unused stack space so it is not in physical memory. Though if your threads allocate enough memory while live you could cause thrashing as each processor's memory is paged to/from secondary device.
I doubt a sleeping thread has any (very little) impact on the system.
It is not using any CPU
Any memory it is using can be paged out to a secondary device.

I guess this can be measured quite easily.
Get the amount of resources used by the system before creating a thread
Create a thread with default system values (default heap size and others)
Get the amount of resources after creating a thread and make the difference (with step 1).
Note that some threads need to be specified different values than the default ones.
You can try and find an average memory use by creating various number of threads (step 2).
The memory allocated by the OS when creating a thread consists of threads local data: TCB TLS ...
From wikipedia: "Threads do not own resources except for a stack, a copy of the registers including the program counter, and thread-local storage (if any)."

Related

How to eager commit allocated memory in C++?

The General Situation
An application that is extremely intensive on both bandwidth, CPU usage, and GPU usage needs to transfer about 10-15GB per second from one GPU to another. It's using the DX11 API to access the GPU, so upload to the GPU can only happen with buffers that require mapping for each single upload. The upload happens in chunks of 25MB at a time, and 16 threads are writing buffers to mapped buffers concurrently. There's not much that can be done about any of this. The actual concurrency level of the writes should be lower, if it weren't for the following bug.
It's a beefy workstation with 3 Pascal GPUs, a high-end Haswell processor, and quad-channel RAM. Not much can be improved on the hardware. It's running a desktop edition of Windows 10.
The Actual Problem
Once I pass ~50% CPU load, something in MmPageFault() (inside the Windows kernel, called when accessing memory which has been mapped into your address space, but was not committed by the OS yet) breaks horribly, and the remaining 50% CPU load is being wasted on a spin-lock inside MmPageFault(). The CPU becomes 100% utilized, and the application performance completely degrades.
I must assume that this is due to the immense amount of memory which needs to be allocated to the process each second and which is also completely unmapped from the process every time the DX11 buffer is unmapped. Correspondingly, it's actually thousands of calls to MmPageFault() per second, happening sequentially as memcpy() is writing sequentially to the buffer. For each single uncommitted page encountered.
One the CPU load goes beyond 50%, the optimistic spin-lock in the Windows kernel protecting the page management completely degrades performance-wise.
Considerations
The buffer is allocated by the DX11 driver. Nothing can be tweaked about the allocation strategy. Use of a different memory API and especially re-use is not possible.
Calls to the DX11 API (mapping/unmapping the buffers) all happens from a single thread. The actual copy operations potentially happen multi-threaded across more threads than there are virtual processors in the system.
Reducing the memory bandwidth requirements is not possible. It's a real-time application. In fact, the hard limit is currently the PCIe 3.0 16x bandwidth of the primary GPU. If I could, I would already need to push further.
Avoiding multi-threaded copies is not possible, as there are independent producer-consumer queues which can't be merged trivially.
The spin-lock performance degradation appears to be so rare (because the use case is pushing it that far) that on Google, you won't find a single result for the name of the spin-lock function.
Upgrading to an API which gives more control over the mappings (Vulkan) is in progress, but it's not suitable as a short-term fix. Switching to a better OS kernel is currently not an option for the same reason.
Reducing the CPU load doesn't work either; there is too much work which needs to be done other than the (usually trivial and inexpensive) buffer copy.
The Question
What can be done?
I need to reduce the number of individual pagefaults significantly. I know the address and size of the buffer which has been mapped into my process, and I also know that the memory has not been committed yet.
How can I ensure that the memory is committed with the least amount of transactions possible?
Exotic flags for DX11 which would prevent de-allocation of the buffers after unmapping, Windows APIs to force commit in a single transaction, pretty much anything is welcome.
The current state
// In the processing threads
{
DX11DeferredContext->Map(..., &buffer)
std::memcpy(buffer, source, size);
DX11DeferredContext->Unmap(...);
}

Current workaround, simplified pseudo code:
// During startup
{
SetProcessWorkingSetSize(GetCurrentProcess(), 2*1024*1024*1024, -1);
}
// In the DX11 render loop thread
{
DX11context->Map(..., &resource)
VirtualLock(resource.pData, resource.size);
notify();
wait();
DX11context->Unmap(...);
}
// In the processing threads
{
wait();
std::memcpy(buffer, source, size);
signal();
}
VirtualLock() forces the kernel to back the specified address range with RAM immediately. The call to the complementing VirtualUnlock() function is optional, it happens implicitly (and at no extra cost) when the address range is unmapped from the process. (If called explicitly, it costs about 1/3rd of the locking cost.)
In order for VirtualLock() to work at all, SetProcessWorkingSetSize() needs to be called first, as the sum of all memory regions locked by VirtualLock() can not exceed the minimum working set size configured for the process. Setting the "minimum" working set size to something higher than the baseline memory footprint of your process has no side effects unless your system is actually potentially swapping, your process will still not consume more RAM than the actual working set size.
Just the use of VirtualLock(), albeit in individual threads and using deferred DX11 contexts for Map / Unmap calls, did instantly decrease the performance penalty from 40-50% to slightly more acceptable 15%.
Discarding the use of a deferred context, and exclusively triggering both all soft faults, as well as the corresponding de-allocation when unmapping on a single thread, gave the necessary performance boost. The total cost of that spin-lock is now down to <1% of the total CPU usage.
Summary?
When you expect soft faults on Windows, try what you can to keep them all in the same thread. Performing a parallel memcpy itself is unproblematic, in some situations even necessary to fully utilize the memory bandwidth. However, that is only if the memory is already committed to RAM yet. VirtualLock() is the most efficient way to ensure that.
(Unless you are working with an API like DirectX which maps memory into your process, you are unlikely to encounter uncommitted memory frequently. If you are just working with standard C++ new or malloc your memory is pooled and recycled inside your process anyway, so soft faults are rare.)
Just make sure to avoid any form of concurrent page faults when working with Windows.

Is dynamic memory access detrimental in real-time programs?

I'm working on a project that contains a real-time software component, using the RT PREEMPT patch on Linux.
During "idle" operation the software just sits waiting for incoming TCP connections and requests. Depending on the request, the software may create a real-time thread that runs for a period of time. So the entire application doesn't need to operate in real-time, only this one thread from time to time.
My question is this: I'm well aware that dynamic memory allocation is non-deterministic and is detrimental to real-time code. However, is accessing existing memory on the heap also detrimental to real-time constraints?
I ask because I'm considering a situation where the program starts up, allocates any required structures on the heap, then triggers a real-time thread that accesses the heap.
EDIT: Once the real-time thread has started, other threads are prevented from writing to variables the real-time thread needs to access using locks (well, except for one variable that must be updated, but access is still restricted using a separate lock).
EDIT2: I forgot to mention that the program will ultimately be deployed on a system that doesn't have any swap space, so I don't think the paging of memory should be an issue. (Though of course this doesn't avoid the issue of page-faults through memory that hasn't yet been provisioned by the OS.)

It is possible that the virtual memory manager might move your memory to swap making your thread generate a major page fault when it runs. You need to lock the memory pages using mlock(). I also recommend allocating memory in chunks and writing to all of it with memset() before using it to avoid minor page faults at run time and use placement new instead of the regular one to create your objects in the already allocated memory.

is accessing existing memory on the heap also detrimental to real-time constraints?
No, unless your system is thrashing.
BTW, you could consider writing your own allocation (e.g. above mmap(2)...) and using mlock(2) for the memory that ought to be in RAM.

What is the maximum number of threads a process can have in windows

In a windows process is there any limit for the threads to be used at a time. If so what is the maximum number of threads that can be used per process?

There is no limit that I know of, but there are two practical limits:
The virtual space for the stacks. For example in 32-bits the virtual space of the process is 4GB, but only about 2G are available for general use. By default each thread will reserve 1MB of stack space, so the top value are 2000 threads. Naturally you can change the size of the stack and make it lower so more threads will fit in (parameter dwStackSize in CreateThread or option /STACK in the linker command). If you use a 64-bits system this limit practically dissapears.
The scheduler overhead. Once you read the thousands of threads, just scheduling them will eat nearly 100% of your CPU time, so they are mostly useless anyway. This is not a hard limit, just your program will be slower and slower the more threads you create.

The actual limit is determined by the amount of available memory in various ways. There is no limit of "you can't have more than this many" of threads or processes in Windows, but there are limits to how much memory you can use within the system, and when that runs out, you can't create more threads.
See this blog by Mark Russinovich:
http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx

How to guarantee that when a process calls malloc(), it will allocate physical memory immediately?

I am looking for a way to pre-allocate memory to a process (PHYSICAL memory), so that it will be absolutely guaranteed to be available to the C++ heap when I call new/malloc. I need this memory to be available to my process regardless of what other processes are trying to do with the system memory. In other words, I want to reserve physical memory to the C++ heap, so that it will be available immediately when I call malloc().
Here are the details:
I am developing a real-time system. The system is composed of several memory-hungry processes. Process A is the mission-critical process and it must survive and be immune to bad behavior of any other processes. It usually fits in 0.5 GB of memory, but it sometimes needs as much as 2.5 GB. The other processes attempt to use any amount of memory.
My concern is that the other processes may allocate lots of memory, exhausting the physical memory reserves in the system. Then, when Process A needs more memory FAST, it's not available, and the system will have to swap pages, which would take a long time.
It is critical that Process A get all the memory it needs without delay, whereas I'm fine with the other processes failing.
I'm running on Windows 7 64-bit.
Edit:
Would SetProcessWorkingSetSize work? Meaning: Would calling this for a big enough amount of memory protect my process A from any other process in the system.

VirtualLock is what you're looking for. It will force the OS to keep the pages in memory, as long as they're in the working set size (which is the function linked to by MK in his answer). However, there is no way to feed this memory to malloc/new- you'll have to implement your own memory allocator.

I think this question is weird because Windows 7 is not exactly the OS of choice for realtime applications. That said, there appears to be an interface that might help you:
AllocateUserPhysicalPages

How to write a C or C++ program to act as a memory and CPU cycle filler?

I want to test a program's memory management capabilities, for example (say, program name is director)
What happens if some other processes take up too much memory, and there is too less memory for director to run? How does director behave?
What happens if too many of the CPU cycles are used by some other program while director is running?
What happens if memory used by the other programs is freed after sometime? How does director claim the memory and start working at full capabilities.
etc.
I'll be doing these experiments on a Unix machine. One way is to limit the amount of memory available to the process using ulimit, but there is no good way to have control over the CPU cycle utilization.
I have another idea. What if I write some program in C or C++ that acts as a dynamic memory and CPU filler, i.e. does nothing useful but eats up memory and/or CPU cycles anyways?
I need some ideas on how such a program should be structured. I need to have dynamic(runtime) control over memory used and CPU used.
I think that creating a lot of threads would be a good way to clog up the CPU cycles. Is that right?
Is there a better approach that I can use?
Any ideas/suggestions/comments are welcome.

http://weather.ou.edu/~apw/projects/stress/
Stress is a deliberately simple workload generator for POSIX systems. It imposes a configurable amount of CPU, memory, I/O, and disk stress on the system. It is written in C, and is free software licensed under the GPLv2.
The functionality you seek overlaps the feature set of "test tools". So also check out http://ltp.sourceforge.net/tooltable.php.

If you have a single core this is enough to put stress on a CPU:
while ( true ) {
x++;
}
If you have lots of cores then you need a thread per core.
You make it variably hungry by adding a few sleeps.
As for memory, just allocate lots.

There are several problems with such a design:
In a virtual memory system, memory size is effectively unlimited. (Well, it's limited by your hard disk...) In practice, systems usually run out of address space well before they run out of backing store -- and address space is a per-process resource.
Any reasonable (non realtime) operating system is going to limit how much CPU time and memory your process can use relative to other processes.
It's already been done.
More importantly, I don't see why you would ever want to do this.

Dynamic memory control, you could just allocate or free buffers of a certain size to use or free more or less memory. As for CPU utilization, you will have to get an OS function to check this and periodically check it and see if you need to do useful work.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js