Can CreateThread interfere with VirtualAlloc usage? - c++

Is it possible for stack space allocated by CreateThread to interfere with the usage of VirtualAlloc? I can't find any discussion or documentation explaining precisely where stack space is allowed to be allocated...
The following more precisely illustrates my question:
uint8_t *baseA = (uint8_t*)VirtualAlloc(NULL,1,MEM_RESERVE,PAGE_NOACCESS);
// Create a thread with the default stack size
HANDLE hThread = CreateThread(NULL,0,SomeThreadProc,NULL,NULL,NULL);
// Possibly create even more threads here.
// Can this ever fail in the absence of other allocators? It doesn't here...
uint8_t *baseB = (uint8_t*)VirtualAlloc(NULL,1,MEM_RESERVE,PAGE_NOACCESS);
// Furthermore, in this test, baseB-baseA == 65536 (unless the debugger did something),
// so nothing appeared between baseA and baseB... not even enough space for the
// full 64kb of wastage, as baseA points to 4096 bytes by itself
If it does in fact use some analogue of VirtualAlloc, is there a way to change how Windows allocates stack space in a given process?

Stack space can be allocated anywhere in the address space of the process. There is no documentation on this now and it is unlikely that such documentation will appear in the future.
You can safely assume that creation of the thread and virtual alloc are independent. If this would not be the case, a lot of things would be broken. Allocator cannot give out overlapping address ranges. This is unthinkable. The problem is somewhere else.
The only thing that might look like correlation - amount of memory used and virtual address space fragmentation. In this case the latest request will simply fail.
I worked on a memory analysis utilities.
This picture shows distribution of the numbers of virtual allocations per size of the allocation.
This is example of the address space contents for a 32-bit process (blue - committed, magenta - reserved, green is a free memory).
What I write here is based on a real experience.

the windows NT kernel treats memory alloc operations on a high interrupt priority, also in a thread safe manner.
That means only one thread of a process can allocate memory at the same time, which makes all allocation processes thread safe (in theory).
there shouldn't be any interferences between stack allocation and virtual allocation.
Also you should keep in mine that you can allocate 1GB of space but your program still only uses it's 2mb RAM.
That's because windows "pre allocates" virtual space, but it doesen't assign it until you use it (write on it).
Actually the memory management is alot more complicated but for now you can be shure that no allocation operations should interfere, ever, since windows is locking your process onto one core, delaying all other threads alloc requests, as long as allocation is processed. (deadlock)
*EDIT: That also means that allocation and de-allocation is kinda a performance needing process if you allocate millions of small bits. It's always better to allocate/de-allocate larger memory areas, because of this deadlock behavior.

Related

Des-initializing a region of memory

I have learn in the few past days the issue with memory overcommitment (when memory overcommit is activated, which is usually a default), which basically means that:
void* p = malloc(100);
the operative system gives you 100 contiguous (virtual) addresses taken from the (virtual) address space of your process, whose total range is OS-defined. Since that memory region has not been initialized yet, it doesn't count as ocuppied storage from a system-wide point of view, so it's a pure abstraction besides consuming your virtual addresses.
memset(p, 0, 5);
That uses the first 5 bytes, so from the point of view of the OS, your process ocuppies now 5 extra bytes, and so the system has 5 bytes less of free storage. You have still 95 bytes of uninitialized storage.
The system only crash or start killing processes when the combined ocuppied storage (initialized) of every process is beyond what the OS can hold.
If my understanding is right at this regard, is there a way to "des-"initialize a region of memory when you are done with it, in order to increase the system-wide free space, without loosing the address region requested by malloc or aligned_malloc (so you don't increase fragmentation over time)?
The purpose of this question is more theoretical than practical and not about actually "freeing memory", but about freeing memory while conserving already assigned virtual addresses.
Source about the difference between requesting virtual addresses and ocuppying storage: https://www.win.tue.nl/~aeb/linux/lk/lk-9.html#ss9.6
PD: With knowing it for Linux to fill my curiosity I'm ok.
No, there is no way.
On most systems, as soon as you allocate memory, it counts towards RAM or swap.
As your link shows, on Linux, you may need to access the memory once so that the memory actually gets allocated. But as soon as you do, the system must keep that memory available somewhere, in case you access it later.
The way to tell the system you are done with the memory is to actually free it.

Stack allocation for C++ green threads

I'm doing some research in C++ green threads, mostly boost::coroutine2 and similar POSIX functions like makecontext()/swapcontext(), and planning to implement a C++ green thread library on top of boost::coroutine2. Both require the user code to allocate a stack for every new function/coroutine.
My target platform is x64/Linux. I want my green thread library to be suitable for general use, so the stacks should expand as required (a reasonable upper limit is fine, e.g. 10MB), it would be great if the stacks could shrink when too much memory is unused (not required). I haven't figured out an appropriate algorithm to allocate stacks.
After some googling, I figured out a few options myself:
use split stack implemented by the compiler (gcc -fsplit-stack), but split stack has performance overhead. Go has already moved away from split stack due to performance reasons.
allocate a large chunk of memory with mmap() hope the kernel is smart enough to leave the physical memory unallocated and allocate only when the stacks are accessed. In this case, we are at the mercy of the kernel.
reserve a large memory space with mmap(PROT_NONE) and setup a SIGSEGV signal handler. In the signal handler, when the SIGSEGV is caused by stack access (the accessed memory is inside the large memory space reserved), allocate needed memory with mmap(PROT_READ | PROT_WRITE). Here is the problem for this approach: mmap() isn't asynchronous safe, cannot be called inside a signal handler. It still can be implemented, very tricky though: create another thread during program startup for memory allocation, and use pipe() + read()/write() to send memory allocation information from the signal handler to the thread.
A few more questions about option 3:
I'm not sure the performance overhead of this approach, how well/bad the kernel/CPU performs when the memory space is extremely fragmented due to thousands of mmap() call ?
Is this approach correct if the unallocated memory is accessed in kernel space ? e.g. when read() is called ?
Are there any other (better) options for stack allocation for green threads ? How are green thread stacks allocated in other implementations, e.g. Go/Java ?
The way that glibc allocates stacks for normal C programs is to mmap a region with the following mmap flag designed just for this purpose:
MAP_GROWSDOWN
Used for stacks. Indicates to the kernel virtual memory system
that the mapping should extend downward in memory.
For compatibility, you should probably use MAP_STACK too. Then you don't have to write the SIGSEGV handler yourself, and the stack grows automatically. The bounds can be set as described here What does "ulimit -s unlimited" do?
If you want a bounded stack size, which is normally what people do for signal handlers if they want to call sigaltstack(2), just issue an ordinary mmap call.
The Linux kernel always maps physical pages that back virtual pages, catching the page fault when a page is first accessed (perhaps not in real-time kernels but certainly in all other configurations). You can use the /proc/<pid>/pagemap interface (or this tool I wrote https://github.com/dwks/pagemap) to verify this if you are interested.
Why mmap? When you allocate with new (or malloc) the memory is untouched and definitely not mapped.
const int STACK_SIZE = 10 * 1024*1024;
char*p = new char[STACK_SIZE*numThreads];
p now has enough memory for the threads you want. When you need the memory, start accessing p + STACK_SIZE * i

Stack memory not released

I have the following loop, which pops a C++ concurrent queue I have, from the implementation here. https://juanchopanzacpp.wordpress.com/2013/02/26/concurrent-queue-c11/
while (!interrupted)
{
pxData data = queue->pop();
if (data.value == -1)
{
break; // exit loop on terminating condition
}
usleep(7000); // stub to simulate processing
}
I am looking at the memory history using System Monitor in CentOS7.
I'm trying to free up the memory taken up by the queue, after reading the value from the queue. However, as the following while loop runs, I don't see the memory usage going down. I've verified that the queue length does go down.
It does go down, however, when -1 is encountered and the loop exits. (program is still running) But I can't have this, because where usleep is, I want to do some intensive processing.
Question: Why doesn't the memory occupied by data get free-ed? (according to System Monitor) Isn't the stack allocated memory supposed to be free-ed when the variable goes out of scope?
The struct is defined as follows, and populated at the beginning of the program.
typedef struct pxData
{
float value; // -1 value terminates the loop
float x, y, z;
std::complex<float> valueData[65536];
} pxData;
It's populated with ~10000 pxData, which roughly translates to 5GB. System only has ~8GB.
So it's important that the memory is free-ed up for doing other processing in the system.
There are a few things at play here.
Virtual Memory
First, you need to understand that just because your program is "using" 5 GB of memory does not mean that there are only 3 GB of RAM left for other programs. Virtual memory means that those 5 GB might be only 1 GB of actual "resident" data, and the other 4 GB may actually be on disk rather than in RAM. So it's important to look at the "resident set size" rather than the "virtual size" when you're looking at your program. And note that if your system actually runs low on RAM, the OS may shrink the RSS of some programs by "paging out" some of their memory. So don't worry too much about "5 GB" appearing in the system monitor--worry if you have a real, concrete performance problem.
Heap Allocation
The second aspect is why your virtual size does not decrease as you remove items from the queue. We can guess that you put those elements into the queue by creating them with malloc or new one-by-one, then pushing them onto the back of the queue. This means that the first element you allocated will come out of the queue first. And that in turn means that when you have drained 90% of the queue, your memory allocation might look like this:
[program|------------------unused-------------------|pxData]
The problem here is that in the real world, just because you free or delete something does not mean the operating system instantly reclaims that memory. In fact, it may not be able to reclaim any unused spans unless they are at the "end" (i.e. most recently allocated). Since C++ does not have garbage collection and cannot move items around in memory without your consent, you end up with this big "hole" in your program's virtual memory. That hole would be used to satisfy future memory allocation requests, but if you haven't got any, it just sits there, until the queue is completely empty:
[program|------------------unused--------------------------]
Then the system is able to shrink your virtual address space back down:
[program]
Which brings you back to where you started.
Solutions
If you want to "fix" this, one option is to allocate your memory in "reverse", i.e. put the last items allocated into the front of the queue.
Another option is to allocate the elements for the queue via mmap, which is something that e.g. Linux will do automatically for allocations which are "large." You can change the threshold for this by calling mallopt(3) with M_MMAP_THRESHOLD and setting it to be a little bit smaller than your struct size. This makes the allocations independent of each other, so the OS can reclaim them individually. This technique can even be applied to existing programs without recompilation, so is often useful if you need to solve this problem in a program you cannot modify.
A C++ implementation would call some operator delete to release dynamically allocated (using some operator new) memory. In several C++ standard libraries, new calls malloc and delete calls free.
(I am focusing with a Linux point of view, but the principles are similar on other OSes)
But while malloc (or ::operator new) is sometimes asking the OS kernel some more memory by system calls changing the virtual address space like mmap(2), free (or ::operator delete) is often simply marking the released memory zone as re-available to future calls to malloc (or to new)
So from the kernel point of view (e.g. as seen thru /proc/, see proc(5)...), the virtual address space is not changing, and the memory remains consumed, even if inside the application it is marked as "freed" and will be reused at some future allocation (by future calls to malloc or new)
And most C++ standard containers are internally using heap data. In particular your local (stack-allocated) std::map or std::vector (or std::deque) variable will call new & delete for internal data.
BTW, I find quite strange your declaration. Unless every struct pxData has exactly 65536 used valueData slots, I would suggest to use some std::vector so have
std::vector<std::complex<float>> valueData;
and improve your code accordingly. You'll probably need to do some valueData.reserve(somesize); and/or valueData.resize(somesize); and/or valueData.push_back(somecomplexnumber); etc....

Why is memory not reusable after allocating/deallocating a number of small objects?

While investigating a memory link in one of our projects, I've run into a strange issue. Somehow, the memory allocated for objects (vector of shared_ptr to object, see below) is not fully reclaimed when the parent container goes out of scope and can't be used except for small objects.
The minimal example: when the program starts, I can allocate a single continuous block of 1.5Gb without problem. After I use the memory somewhat (by creating and destructing an number of small objects), I can no longer do big block allocation.
Test program:
#include <iostream>
#include <memory>
#include <vector>
using namespace std;
class BigClass
{
private:
double a[10000];
};
void TestMemory() {
cout<< "Performing TestMemory"<<endl;
vector<shared_ptr<BigClass>> list;
for (int i = 0; i<10000; i++) {
shared_ptr<BigClass> p(new BigClass());
list.push_back(p);
};
};
void TestBigBlock() {
cout<< "Performing TestBigBlock"<<endl;
char* bigBlock = new char [1024*1024*1536];
delete[] bigBlock;
}
int main() {
TestBigBlock();
TestMemory();
TestBigBlock();
}
Problem also repeats if using plain pointers with new/delete or malloc/free in cycle, instead of shared_ptr.
The culprit seems to be that after TestMemory(), the application's virtual memory stays at 827125760 (regardless of number of times I call it). As a consequence, there's no free VM regrion big enough to hold 1.5 GB. But I'm not sure why - since I'm definitely freeing the memory I used. Is it some "performance optimization" CRT does to minimize OS calls?
Environment is Windows 7 x64 + VS2012 + 32-bit app without LAA
Sorry for posting yet another answer since I am unable to comment; I believe many of the others are quite close to the answer really :-)
Anyway, the culprit is most likely address space fragmentation. I gather you are using Visual C++ on Windows.
The C / C++ runtime memory allocator (invoked by malloc or new) uses the Windows heap to allocate memory. The Windows heap manager has an optimization in which it will hold on to blocks under a certain size limit, in order to be able to reuse them if the application requests a block of similar size later. For larger blocks (I can't remember the exact value, but I guess it's around a megabyte) it will use VirtualAlloc outright.
Other long-running 32-bit applications with a pattern of many small allocations have this problem too; the one that made me aware of the issue is MATLAB - I was using the 'cell array' feature to basically allocate millions of 300-400 byte blocks, causing exactly this issue of address space fragmentation even after freeing them.
A workaround is to use the Windows heap functions (HeapCreate() etc.) to create a private heap, allocate your memory through that (passing a custom C++ allocator to your container classes as needed), and then destroy that heap when you want the memory back - This also has the happy side-effect of being very fast vs delete()ing a zillion blocks in a loop..
Re. "what is remaining in memory" to cause the issue in the first place: Nothing is remaining 'in memory' per se, it's more a case of the freed blocks being marked as free but not coalesced. The heap manager has a table/map of the address space, and it won't allow you to allocate anything which would force it to consolidate the free space into one contiguous block (presumably a performance heuristic).
There is absolutely no memory leak in your C++ program. The real culprit is memory fragmentation.
Just to be sure(regarding memory leak point), I ran this program on Valgrind, and it did not give any memory leak information in the report.
//Valgrind Report
mantosh#mantosh4u:~/practice$ valgrind ./basic
==3227== HEAP SUMMARY:
==3227== in use at exit: 0 bytes in 0 blocks
==3227== total heap usage: 20,017 allocs, 20,017 frees, 4,021,989,744 bytes allocated
==3227==
==3227== All heap blocks were freed -- no leaks are possible
Please find my response to your query/doubt asked in original question.
The culprit seems to be that after TestMemory(), the application's
virtual memory stays at 827125760 (regardless of number of times I
call it).
Yes, real culprit is hidden fragmentation done during the TestMemory() function.Just to understand the fragmentation, I have taken the snippet from wikipedia
"
when free memory is separated into small blocks and is interspersed by allocated memory. It is a weakness of certain storage allocation algorithms, when they fail to order memory used by programs efficiently. The result is that, although free storage is available, it is effectively unusable because it is divided into pieces that are too small individually to satisfy the demands of the application.
For example, consider a situation wherein a program allocates 3 continuous blocks of memory and then frees the middle block. The memory allocator can use this free block of memory for future allocations. However, it cannot use this block if the memory to be allocated is larger in size than this free block."
The above explains paragraph explains very nicely about memory fragmentation.Some allocation patterns(such as frequent allocation and deal location) would lead to memory fragmentation,but its end impact(.i.e. memory allocation 1.5GBgets failed) would greatly vary on different system as different OS/heap manager has different strategy and implementation.
As an example, your program ran perfectly fine on my machine(Linux) however you have encountered the memory allocation failure.
Regarding your observation on VM size remains constant: VM size seen in task manager is not directly proportional to our memory allocation calls. It mainly depends on the how much bytes is in committed state. When you allocate some dynamic memory(using new/malloc) and you do not write/initialize anything in those memory regions, it would not go committed state and hence VM size would not get impacted due to this. VM size depends on many other factors and bit complicated so we should not rely completely on this while understanding about dynamic memory allocation of our program.
As a consequence, there's no free VM regrion big enough to hold 1.5
GB.
Yes, due to fragmentation, there is no contiguous 1.5GB memory. It should be noted that total remaining(free) memory would be more than 1.5GB but not in fragmented state. Hence there is not big contiguous memory.
But I'm not sure why - since I'm definitely freeing the memory I used.
Is it some "performance optimization" CRT does to minimize OS calls?
I have explained about why it may happen even though you have freed all your memory. Now in order to fulfil user program request, OS will call to its virtual memory manager and try to allocate the memory which would be used by heap memory manager. But grabbing the additional memory does depend on many other complex factor which is not very easy to understand.
Possible Resolution of Memory Fragmentation
We should try to reuse the memory allocation rather than frequent memory allocation/free. There could be some patterns(like a particular request size allocation in particular order) which may lead overall memory into fragmented state. There could be substantial design change in your program in order to improve memory fragmentation. This is complex topic and require internal understanding of memory manager to understand the complete root cause of such things.
However there are tools exists on Windows based system which I am not much aware. But I found one excellent SO post regarding the which tool(on windows) can be useful to understand and check the fragmentation status of your program by yourself.
https://stackoverflow.com/a/1684521/2724703
This is not memory leak. The memory U used was allocated by C\C++ Runtime. The Runtime apply a a bulk of memory from OS once and then each new you called will allocated from that bulk memory. when delete one object, the Runtime not return memory to OS immediately, it may hold that memory for performance.
There is nothing here which indicates a genuine "leak". The pattern of memory you describe is not unexpected. Here are a few points which might help to understand. What happens is highly OS dependent.
A program often has a single heap which can be extended or shrunk in length. It is however one contiguous memory area, so changing the size is just changing where the end of the heap is. This makes it very difficult to ever "return" memory to the OS, since even one little tiny object in that space will prevent its shrinking. On Linux you can lookup the function 'brk' (I know you're on Windows, but I presume it does something similar).
Large allocations are often done with a different strategy. Rather than putting them in the general purpose heap, an extra block of memory is created. When it is deleted this memory can actually be "returned" to the OS since its guaranteed nothing is using it.
Large blocks of unused memory don't tend to consume a lot of resources. If you generally aren't using the memory any more they might just get paged to disk. Don't presume that because some API function says you're using memory that you are actually consuming significant resources.
APIs don't always report what you think. Due to a variety of optimizations and strategies it may not actually be possible to determine how much memory is in use and/or available on a system at a particular moment. Unless you have intimate details of the OS you won't know for sure what those values mean.
The first two points can explain why a bunch of small blocks and one large block result in different memory patterns. The latter points indicate why this approach to detecting leaks is not useful. To detect genuine object-based "leaks" you generally need a dedicated profiling tool which tracks allocations.
For example, in the code provided:
TestBigBlock allocates and deletes array, assume this uses a special memory block, so memory is returned to OS
TestMemory extends the heap for all the small objects, and never returns any heap to the OS. Here the heap is entirely available from the applications point-of-view, but from the OS's point of view it is assigned to the application.
TestBigBlock now fails, since although it would use a special memory block, it shares the overall memory space with heap, and there just isn't enough left after 2 is complete.

what could be reason for virtual bytes to grow 2x private bytes?

An application's virtual bytes grow 2-times the private bytes.
does this indicate memory leak? bad application design?
OS is 32Bit
any thoughts are welcome.
application is stream database.
Fragmentation.
If you allocate the following chunks of memory:
16KB
8KB
16KB
and you then free the chunk of 8KB, your application will have 32 KB of private bytes, but 40 KB bytes of virtual memory, which is actually the highest virtual memory address that has ever been in use by your process (ignoring the other memory parts for sake of simplicity).
Consider (if possible) using another memory manager. Some alternatives are:
The Windows Low-fragementation heap (see http://msdn.microsoft.com/en-us/library/aa366750%28VS.85%29.aspx for more info)
The Doug-Lea open source memory manager
Commercial alternatives like Hoard
A fourth alternative is to write your own memory manager. It's not that easy, but if done right, it can have quite some benefits. Especially for certain niche or special applications, writing your own memory manager can be useful.
An application's virtual bytes grow 2-times the private bytes.
If application allocates only heap, then to me it would be the sign that application allocates lots of memory but never actually touches it. For example:
void *p = malloc( 16u<<20 );
would eat up 16MB of virtual memory. But as long as application doesn't perform any actions with the memory block, OS wouldn't even attempt to map the virtual memory to the RAM. Simplest way to force the actual allocation of private memory is to memset() it:
void *p = malloc( 16u<<20 );
memset( p, 0, 16u<<20 );
does this indicate memory leak? bad application design?
Or both. Or neither.
The longer variant of the response: unknown, depends on what memory application allocates, what other resources application uses, OS, h/w platform, etc.
If unsure, use a memory leak analysis tools to investigate, e.g. valgrind. Read up SO for more information on memory leak analysis in C++.
Memory allocation has overhead to store management information about what was allocated. If you're allocating very small buffers the extra information can be a significant percentage of the total. That might be what you're seeing.
One possibility is if you set a large stack reserve size for your threads with linker option /STACK:reserve_bytes and then you start a lot of threads.
For example, if you have an ATL service, it automatically starts 4*numberOfCores apartment message dispatching threads by default. Compile and link such a service with /STACK:12000000 (12 megabytes), then run it on a 16-core server and it will start 64 threads, each with a 12MB stack, immediately consuming 768MB of virtual address space, although the actual committed memory may be much lower.