Can Win32 "move" heap-allocated memory? - c++

I have a .NET/native C++ application. Currently, the C++ code allocates memory on the default heap which persists for the life of the application. Basically, functions/commands are executed in the C++ which results in allocation/modification of the current persistent memory. I am investigating an approach for cancelling one of these functions/commands mid-execution. We have hundreds of these commands, and many are very complicated (legacy) code.
The brute-force approach that I am trying to avoid is modifying each and every command/function to check for the cancellation and do all the appropriate clean-up (freeing heap memory). I am investigating a multi-threaded approach in which an additional thread receives the cancellation request and terminates the command-execution thread. I would want all dynamic memory to be allocated on a "private heap" using HeapCreate() (Win32). This way, the private heap could be destroyed by the thread handling the cancellation request. However, if the command runs to completion, I need the dynamic memory to persist. In this case, I would like to do the logical equivalent of "moving" the private heap memory to the default/process heap without incurring the cost of an actual copy. Is this in any way possible? Does this even make sense?
Alternatively, I recognize that I could just have a new private heap for every command/function execution (each will be a new thread). The private heap could be destroyed if the command is cancelled, or it would survive if the command completes. Is there any problem with the number of heaps growing indefinitely? I know there is some overhead involved with each heap. What limitations might I run into?
I am running on Windows 7 64-bit with 8GB RAM (consider this the target platform). The application I am working with is about 1 million SLOC (half C++, half C#). I am looking for any experience/suggestions with private heap management, or just alternatives to my solution.

You might be better off with separate processes instead of separate threads:
use memory mapped files (ie not a file at all - just cross-processed shared memory)
killing a process is 'cleaner' than killing a thread
I think you can have the shared memory 'survive' the killing without a move - you map/unmap instead of move
although you might need to do some memory management on your own.
Anyhow, worth looking into. I was looking into using inter-process memory for a few other things, and it had some unusual properties (can recall all of it clearly, it was a while ago), and you might be able to take advantage of it.
Just an idea!

From MSDN's Heap Functions page:
"Memory allocated by HeapAlloc is not movable. The address returned by HeapAlloc is valid until the memory block is freed or reallocated; the memory block does not need to be locked."
Can you re-link the legacy apps against your own malloc() implementation? If so, you should be able to manage without modifying the rest of the code. Your custom malloc library can track allocated blocks by thread, and have a "FreeAllByThreadId() function which you call after killing the legacy function's thread. You could use private heaps inside the library.
An alternative to private heaps might be doing your own allocation from memory-mapped files. See "Creating Named Shared Memory." You create the shared memory while initializing the alloc library for the legacy thread. On success, map it into the main thread so your c# can access it; on termination, close it and it is released to the system.

Heap is a sort of big chunk of memory. It is a user-level memory manager. A heap is created by lower-level system memory calls (e.g., sbrk in Linux and VirtualAlloc in Windows). In a a heap, then you can request or return a small chunk of memory by malloc/new/free/delete. By default, a process has a single heap (unlike stack, all threads share a heap). But, you can have many heaps.
Is it possible to combine two heaps w/o copying? A heap is essentially a data structure that maintains a list of used and freed memory chunks. So, a heap should have a sort of bookkeeping data called meta data. Of course, this meta data is per heap. AFAIK, no heap manager supports a merge operation of two heaps. I had reviewed entire source code of malloc implementation in Linux glibc (Doug Lea's implementation), but no such operation. Windows Heap* functions are also implemented in a similar way. So, it is currently impossible to move or merge two separate heaps.
Is it possible to have many heaps? I don't think there should be a big problem to have many heaps. As I said before, a heap is just a data structure that keeps used/freed memory chunks. So, there should be some amount of overhead. But, it's not that severe. When you look at one of malloc implementation, there is malloc_state, which is a basic data structure for each heap. For example, you can create another heap by create_mspace (in Windows, it is HeapCreate), then you will get a new malloc state. It's not that big. So, if this tread-off (some heap overhead vs. implementation easiness) is fine, then you may go on.
If I were you, I'll try the way you describe. It makes sense to me. Having a lot of heap objects would not make a big overhead.
Also, it should be noted that technically moving memory regions is impossible. Pointers that pointed the moved memory region will result in dangling pointers.
p.s. Your problem seems like a transaction, especially Software Transactional Memory. A typical implementation of STM buffers pending memory writes, and then commits to the real system memory it the transaction had no conflict.

No. Memory cannot be moved between heaps.

Related

Shouldn't malloc be asynchronous?

Am I correct to assume that when a process calls malloc, there may be I/O involved (swapping caches out etc) to make memory available, which in turn implies it can block considerable time? Thus, shouldn't we have two versions of malloc in linux, one say "fast_malloc" which is suitable for obtaining smaller chunks & guaranteed not to block (but may of course still fail with OUT_OF_MEMORY) and another async_malloc where we could ask for arbitrary-size space but require a callback?
Example: if I need a smaller chunk of memory to make room for an item in the linked-list, I may prefer the traditional inline malloc knowing the OS should be able to satisfy it 99.999% of the time or just fail. Another example: if I'm a DB server trying to allocate a sizable chunk to put indexes in it I may opt for the async_malloc and deal with the "callback complexity".
The reason I brought this up is that I'm looking to create highly concurrent servers handling hundreds of thousands of web requests per second and generally avoid threads for handling the requests. Put another way, anytime I/O occurs I want it to be asynchronous (say libevent based). Unfortunately I'm realizing most C APIs lack proper support for concurrent use. For example, the ubiquitous MySQL C library is entirely blocking, and that's just one library my servers use extensively. Again, I can always simulate non-blocking by offloading to another thread but that's nowhere near as cheap as waiting for result via completion callback.
As kaylum said in a comment:
Calling malloc will not inherently cause more IO. Perhaps you are confusing use of the memory returned versus just allocating the memory to you. Just because you ask for 100MB does not mean that malloc will immediately trigger 100MB of swapping. That only happens when you access the memory.
If you want to protect against long delays for swapping, etc. during subsequent access to the allocated memory, you can call mlock on it in a separate thread (so your process isn't stalled waiting for mlock to complete). Once mlock has succeeded, the memory is physically instantiated and cannot be swapped out until munlock.
Remember that a call to malloc() does not necessarily result in your program asking the OS for more memory. It's down to the C runtime's implementation of malloc().
For glibc malloc() merely (depending on how much you're asking for) returns a pointer to memory that the runtime has already got from the OS. Similarly free() doesn't necessarily return memory to the OS. It's a lot faster that way. I think glibc's malloc() is thread safe too.
Interestingly this gives C, C++ (and everything built on top) the same sort of properties normally associated with languages like Java and C#. Arguably building a runtime like Java or C# on top of a runtime like glibc means that there's actually more work than necessary going on to manage memory... Unless they're not using malloc() or new at all.
There's various allocators out there, and you can link whichever one you want into your program regardless of what your normal C runtime provides. So even on platforms like *BSD (which are typically far more traditional in their memory allocation approach, asking the OS each and every time you call malloc() or new) you can pull off the same trick.
Put another way, anytime I/O occurs I want it to be asynchronous (say libevent based).
I have bad news for you. Any time you access memory you risk blocking for I/O.
malloc itself is quite unlikely to block because the system calls it uses just create an entry in a data structure that tells the kernel "map in some memory here when it's accessed". This means that malloc will only block when it needs to go down to the kernel to map more memory and either the kernel is out of memory so that it itself has to wait for allocating its internal data structure (you can wait for quite a while then) or you use mlockall. The actual allocating of memory that can cause swapping doesn't happen until you touch memory. And your own memory can be swapped out at any time (or your program text could be paged out) and you have pretty much no control over it.

Multithread Memory Profiling in C++

I am working on profiling the memory usage of multiple threads in my application. I would like to be able to track the maximum allocation/current allocation of any given thread that is running. In order to so, I planned on interposing on mallocs/frees. During each call to malloc, I would update the allocation records for the particular thread in a static map that associated thread ids to their particular metadata record. I am currently having issues during process exit. I think the issue is that when all the destructors are called for cleanup, the static map and lock protecting it have to be destroyed. My interposed mallocs/frees, however, acquire the lock before updating the profiling metadata structures. Eventually, the lock is destroyed, but there are subsequent calls to malloc/free that result in an attempt to acquire the no longer existent lock resulting in a segfault.
Another issue that I am concerned about is that there are internal calls to malloc generated within my interposed malloc to allocate entries in the map.
Any ideas on ways of approaching the problem of profiling memory usage on a per thread basis? Any suggestions on data structures to track the usage of each thread? Does the above approach seem reasonable or are there any other ways of approaching the problem?
If you store your "extra" data as part of the allocation itself (before is easier, but you could do it after too - just need a size somewhere), then you shouldn't need any locks at all. Just a tad more memory. Of course, you will need to use atomics to update any lists of items.
If you look at this answer:
Setting memory on a custom heap
and imagine that HeapAlloc and HeapFree are malloc and free respectively. Then add code to store which thread is being used for the allocation.
So, instead of using a map, you simply update a linked list (using atomics to prevent multiple updates). This does of course make it a little more difficult to make the up to date measurements per thread, you'll have to scan the list of allocations.
Of course, this only works for DIRECT calls to malloc and free.
The same principle would be possible by "injecting" a replacement malloc/free function (built along the principles in the other post, but of course not using the original malloc to allocate the memory, and not using free to free the memory).
This is a complicated thing to do and make work for all cases. There are many issues that you'll miss and only ever find through trial and error. I should know, I've been responsible for building a tool that does what you are trying to do. We've been doing this since 1999, available commercially since 2002.
If you are using Windows, C++ Memory Validator can give you per-thread profiling statistics.
http://www.softwareverify.com/cpp-memory.php.
The Objects tab and Sizes tab both have Threads sub-tabs which allow you to view data per thread. You can also run advanced queries on the Analysis tab that will allow you to view data on a per-thread basis.
Spend your time on your job, not writing tools.

Concurrent dynamic memory management on an AVR32

I am developing the software system for an embedded system of a student satellite. Our code is a mix of C/C++, running on an an AT32UC3A3256S 32-bit AVR microcontroller. We are running the FreeRTOS operating system on the hardware which is working fine. We also have a need for a somewhat specialized memory management scheme due to the physical memory layout, and concept of operations for the mission.
I have been attempting to use a dynamic memory implementation called dlmalloc, largely due to the availability of mspaces, which allows dynamic memory allocation into contained and tracked sections. I have some code that wraps around dlmalloc, in order to create mspaces in certain places in memory, and cause allocations to be tied to these mspaces depending on the FreeRTOS task making the request. The end product is a memory management system that amount of memory a given task has allocated, and if it has gone over it's imposed limit, will reset the task and free it's memory.
I have created a test task that essentially is a big memory leak, continuously allocating memory without freeing it. The memory management system in place should periodically reset this task as it overflows its limit, freeing all memory that would otherwise be leaked. This works perfectly for a single task running, however fails in very odd ways if two similar copies of this task run simultaneously, leading me to believe that the memory allocation is not thread safe.
I have surrounded every call to memory allocation routines with FreeRTOS routines that ensure that only the task allocating memory will run for the duration of the memory request. To me this seems like it should provide the thread-safeness needed, but obviously something else is wrong.
Does anybody have any ideas on what I might be missing to make this system thread-safe, on how to port dlmalloc to the hardware I am using, on any other concurrent memory allocators I could possibly use, or any advice at all? I can provide much more information if necessary but not did want to bloat the original post more than I already have.

When HeapCreate function is used or in what cases do you need a number of heaps?

Windows API has a set of function for heap creation and handling: HeapCreate, HeapAlloc, HeapDestroy and etc.
I wonder what is the use for another heap in a program?
From fragmentation point of view, you will get external fragmentation where memory is not reused among heaps. So even if low-fragmentation heaps are used, stil there is a fragmentation.
Memory management of additional heaps seems to be low-level. So they are not easy to use.
In addition, additional heap can probably be emulated using allocations from heap and managing allocated memory.
So what is the usage? Did you use it?
One use case might be a long-running complex process that does a lot of memory allocation and deallocation. If the user wants to interrupt the process, then an easy way to clean up the memory currently allocated might be to have everything on a private heap and then simply destroy the heap.
I have seen this technique used in an embedded system (which wasn't using Windows, so it didn't use those exact API functions). The custom memory allocator had a feature to "mark" a specific state of the heap and then "rewind" to that point if a process was aborted.
One reason that is only important in rare situations, but immensely important there: memory allocated by new/malloc isn't executable on modern Windows systems. Hence if you write for example a JIT, you will have to use HeapCreate with HEAP_CREATE_ENABLE_EXECUTE.
Use: Very very very rarily.
Usage:
I once worked on a projected that used the heap management as a crude garbage collector (no destructors). There was a section of the code that went off a did some work without regard to memory management (using a separate heap). Then when it was done we just destroyed that heap to re-claim all the memory.
One use is for fixed size objects. If you need to do a lot of allocation/deallocation of objects that are all the same size (i.e. small message buffers) a private heap avoids fragmentation issues.
You might also dedicate a heap per thread - for locality of reference or to reduce locking (which is required when a heap is shared across threads).
One use case I see more often than not is in malware.
The malware would have a packed binary somewhere in its .rsrc section, allocate an executable private heap, and then run the code there. Its a very effective technique
One usage not mentioned here is to avoid heap contention.
You could create a thread-local heap which is not thread-safe, passing HEAP_NO_SERIALIZE flag to HeapCreate.
Since only one thread can access the heap, no locks are required and contention is alleviated.

Memory leak in c++

I am running my c++ application on an intel Xscale device. The problem is, when I run my application offtarget (Ubuntu) with Valgrind, it does not show any memory leaks.
But when I run it on the target system, it starts with 50K free memory, and reduces to 2K overnight. How to catch this kind of leakage, which is not being shown by Valgrind?
A common culprit with these small embedded deviecs is memory fragmentation. You might have free memory in your application between 2 objects. A common solution to this is the use of a dedicated allocator (operator new in C++) for the most common classes. Memory pools used purely for objects of size N don't fragment - the space between two objects will always be a multiple of N.
It might not be an actual memory leak, but maybe a situation of increasing memory usage. For example it could be allocating a continually increasing string:
string s;
for (i=0; i<n; i++)
s += "a";
50k isn't that much, maybe you should go over your source by hand and see what might be causing the issue.
This may be not a leak, but just the runtime heap not releasing memory to the operating system. This can also be fragmentation.
Possible ways to overcome this:
Split into two applications. The master application will have the simple logic with little or no dynamic memory usage. It will start the worker application to actually do work in such chunks that the worker application will not run out of memory and will restart that application periodically. This way memory is periodically returned to the operating system.
Write your own memory allocator. For example you can allocate a dedicated heap and only allocate memory from there, then free the dedicated heap entirely. This requires the operating system to support multiple heaps.
Also note that it's possible that your program runs differently on Ubuntu and on the target system and therefore different execution paths are taken and the code resulting in memory leaks is executed on the target system, but not on Ubuntu.
This does sounds like fragmentation. Fragmentation is caused by you allocating objects on the stack, say:
object1
object2
object3
object4
And then deleting some objects
object1
object3
object4
You now have a hole in the memory that is unused. If you allocate another object that's too big for the hole, the hole will remain wasted. Eventually with enough memory churn, you can end up with so many holes that they waste you memory.
The way around this is to try and decide your memory requirements up front. If you've got particular objects that you know you are creating many of, try and ensure they're the same size.
You can use a pool to make the allocations more efficient for a particular class... or at least let you track it better so you can understand what's going on and come up with a good solution.
One way of doing this is to create a single static:
struct Slot
{
Slot() : free(true) {}
bool free;
BYTE data[20]; // you'll need to tune the value 20 to what your program needs
};
Slot pool[500]; // you'll need to pick a good pool size too.
Create the pool up front when your program starts and pre-allocate it so that it is as big as the maximum requirements for your program. You may want to HeapAlloc it (or the equivalent in your OS so that you can control when it appears from somewhere in you application startup).
Then override the new and delete operators for a suspect class so that they return slots from this vector. So, your objects will be stored in this vector.
You can override new and delete for classes of the same size to be put in this vector.
Create pools of different sizes for different objects.
Just go for the worst offenders at first.
I've done something like this before and it solved my problem on an embedded device. I also was using a lot of STL, so I created a custom allocator (google for stl custom allocator - there are loads of links). This was useful for records stored in a mini-database my program used.
If your memory usage goes down, i don't think it can be defined as a memory leak.
Where are you getting reports of memory usage ? The system might just have put most of your program's memory use in virtual memory.
All i can add is that Valgrind is known to be pretty efficient at finding memory leaks !
Also, are you sure when you profiled your code, the code-coverage was enough to cover all the code-paths which might be executed on target platform?
Valgrind for sure does not lie. As has been pointed out, this might indeed be the runtime heap not releasing the memory, but i would think otherwise.
Are you using any sophisticated technique to track the scope of object..?
if yes, than valgrind is not smart enough, Though you can try by setting xscale related option with valgrind
Most applications show a pattern of memory use like this:
they use very little when they start
as they create data structures they use more and more
as they start deleting old data structures or reusing existing ones, they reach a steady state where memory use stays roughly constant
If your app is continuosly increasing in size, you may have aleak. If it increases in sizze over aperiod and then reaches arelatively steady state, you probably don't.
You can use the massif tool from Valgrind, which will show you where the most memory is allocated and how it evolves over time.