cannot allocate memory fast enough? - c++

Assume you are tasked to address a performance bottleneck in an application. Via profiling we discover the bottleneck is related to memory allocation. We discover that the application can only perform N memory allocations per second, no matter how many threads we have allocating memory. Why would we be seeing this behavior and how might we increase the rate at which the application can allocate memory. (Assume that we cannot change the size of the memory blocks that we are allocating. Assume that we cannot reduce the use of dynamically allocated memory.)

Okay, a few solutions exist - however almost all of them seem to be excluded via some constraint or another.
1. Have more threads allocate memory
We discover that the application can only perform N memory allocations per second, no matter how many threads we have allocating memory.
From this, we can cross-off any ideas of adding more threads (since "no matter how many threads"...).
2. Allocate more memory at a time
Assume that we cannot change the size of the memory blocks that we are allocating.
Fairly obviously, we have to allocate the same block size.
3. Use (some) static memory
Assume that we cannot reduce the use of dynamically allocated memory.
This one I found most interesting.. Reminded me of a story I heard about a FORTRAN programmer (before Fortran had dynamic memory allocation) whom just used a HUGE static array allocated on the stack as a private heap.
Unfortunately, this constraint prevents us from using such a trick.. However, it does give a glean into one aspect of a (the) solution.
My Solution
At the start of execution (either of the program, or on a per-thread basis) make several^ memory allocation system calls. Then use the memory from these later in the program (along with the existing dynamic memory allocations).
* Note: The 'several' would probably be an exact number, determined from your profiling, which the question mentions in the beginning.
TL;DR
The trick is to modify the timing of the memory allocations.

Looks like a challenging problem, though without details, you can only do some guesses. (Which is most likely the idea of this question)
The limitation here is the number of allocations, not the size of the allocation.
If we can assume that you are in control of where it allocations occur, you can allocate the memory for multiple instances at once. Please consider the code below as pseudo code, as it's only for illustration purpose.
const static size_t NR_COMBINED_ALLOCATIONS = 16;
auto memoryBuffer = malloc(size_of(MyClass)*NR_COMBINED_ALLOCATIONS);
size_t nextIndex = 0;
// Some looping code
auto myNewClass = new(memoryBuffer[nextIndex++]) MyClass;
// Some code
myNewClass->~MyClass();
free(memoryBuffer);
Your code will most likely become a lot more complex, though you will most likely tackle this bottleneck. In case you have to return this new class, you even need even more code just to do memory management.
Given this information, you can write your own implementation of allocators for your STL, override the 'new' and 'delete' operators ...
If that would not be enough, try challenging the limitations. Why can you only do a fixed number of allocations, is this because of unique locking? If so, can we improve this? Why do you need that many allocations, would changing the algorithm that is being used fix this issue ...

... the application can only perform N memory allocations per second,
no matter how many threads we have allocating memory. Why would we be
seeing this behavior and how might we increase the rate at which the
application can allocate memory.
IMHO, the most likely cause is that the allocations are coming from a common system pool.
Because they share a pool, each thread has to gain access thru some critical section blocking mechanism (perhaps a semaphore).
The more threads competing for dynamic memory (i.e. using new) will cause more critical section blocking.
The context switch between tasks is the time waste here.
How increase the rate?
option 1 - serialize the usage ... and this means, of course, that you can not simply try to use a semaphore at another level. For one system I worked on, a high dynamic memory utilization happened during system start up. In that case, it was easiest to change the start up such that thread n+1 (of this collection) only started after thread n had completed its initialization and fell into its wait-for-input loop. With only 1 thread doing its start up thing at a time, (and very few other dynamic memory users yet running) no critical section blockage occurred. 4 simultaneous start ups would take 30 seconds. 4 serialized startups finished in 5 seconds.
option 2 - provide a pool of ram and a private new/delete for each particular thread. If only one thread access a pool at a time, a critical section or semaphore is not needed. In an embedded system, the challenge here is allocate a reasonable amount of private pool for the thread and not too much waste. On a desktop with multi-gigabytes of ram, this is probably less of a problem.

I believe you could use a separate thread which could be responsible for memory allocation. This thread would have a queue containing a map of thread identifiers and needed memory allocation. Threads would not directly allocate memory, but rather send an allocation request to the queue and go into a wait state. The queue, on its turn would try to process each requested memory allocation from the queue and wake the corresponding sleeping thread up. When the thread responsible for memory handling can not process an allocation due to limitation, it should wait until memory can be allocated again.
One could build another layer into the solution as #Tersosauros's solution suggested to slightly optimize speed, but it should be based on something like the idea above nonetheless.

Related

Understanding Memory Pools

To my understanding, a memory pool is a block, or multiple blocks of memory allocate on the stack before runtime.
By contrast, to my understanding, dynamic memory is requested from the operating system and then allocated on the heap during run time.
// EDIT //
Memory pools are evidently not necessarily allocated on the stack ie. a memory pool can be used with dynamic memory.
Non dynamic memory is evidently also not necessarily allocated on the stack, as per the answer to this question.
The topics of 'dynamic vs. static memory' and 'memory pools' are thus not really related although the answer is still relevant.
From what I can tell, the purpose of a memory pool is to provide manual management of RAM, where the memory must be tracked and reused by the programmer.
This is theoretically advantageous for performance for a number of reasons:
Dynamic memory becomes fragmented over time
The CPU can parse static blocks of memory faster than dynamic blocks
When the programmer has control over memory, they can choose to free and rebuild data when it is best to do so, according the the specific program.
4. When multithreading, separate pools allow separate threads to operate independently without waiting for the shared heap (Davislor)
Is my understanding of memory pools correct? If so, why does it seem like memory pools are not used very often?
It seems this question is thwart with XY problem and premature optimisation.
You should focus on writing legible code, then using a profiler to perform optimisations if necessary.
Is my understanding of memory pools correct?
Not quite.
... on the stack ...
... on the heap ...
Storage duration is orthogonal to the concept of pools; pools can be allocated to have any of the four storage durations (they are: static, thread, automatic and dynamic storage duration).
The C++ standard doesn't require that any of these go into a stack or a heap; it might be useful to think of all of them as though they go into the same place... after all, they all (commonly) go onto silicon chips!
... allocate ... before runtime ...
What matters is that the allocation of multiple objects occurs before (or at least less often than) those objects are first used; this saves having to allocate each object separately. I assume this is what you meant by "before runtime". When choosing the size of the allocation, the closer you get to the total number of objects required at any given time the less waste from excessive allocation and the less waste from excessive resizing.
If your OS isn't prehistoric, however, the advantages of pools will quickly diminish. You'd probably see this if you used a profiler before and after conducting your optimisation!
Dynamic memory becomes fragmented over time
This may be true for a naive operating system such as Windows 1.0. However, in this day and age objects with allocated storage duration are commonly stored in virtual memory, which periodically gets written to, and read back from disk (this is called paging). As a consequence, fragmented memory can be defragmented and objects, functions and methods that are more commonly used might even end up being united into common pages.
That is, paging forms an implicit pool (and cache prediction) for you!
The CPU can parse static blocks of memory faster than dynamic blocks
While objects allocated with static storage duration commonly are located on the stack, that's not mandated by the C++ standard. It's entirely possible that a C++ implementation may exist where-by static blocks of memory are allocated on the heap, instead.
A cache hit on a dynamic object will be just as fast as a cache hit on a static object. It just so happens that the stack is commonly kept in cache; you should try programming without the stack some time, and you might find that the cache has more room for the heap!
BEFORE you optimise you should ALWAYS use a profiler to measure the most significant bottleneck! Then you should perform the optimisation, and then run the profiler again to make sure the optimisation was a success!
This is not a machine-independent process! You need to optimise per-implementation! An optimisation for one implementation is likely a pessimisation for another.
If so, why does it seem like memory pools are not used very often?
The virtual memory abstraction described above, in conjunction with eliminating guess-work using cache profilers virtually eliminates the usefulness of pools in all but the least-informed (i.e. use a profiler) scenarios.
A customized allocator can help performance since the default allocator is optimized for a specific use case, which is infrequently allocating large chunks of memory.
But let's say for example in a simulator or game, you may have a lot of stuff happening in one frame, allocating and freeing memory very frequently. In this case the default allocator is not as good.
A simple solution can be allocating a block of memory for all the throwaway stuff happening during a frame. This block of memory can be overwritten over and over again, and the deletion can be deferred to a later time. e.g: end of a game level or whatever.
Memory pools are used to implement custom allocators.
One commonly used is a linear allocator. It only keeps a pointer seperating allocated/free memory. Allocating with it is just a matter of incrementing the pointer by the N bytes requested, and returning it's previous value. And deallocation is done by resetting the pointer to the start of the pool.

Heap optimized for (but not limited to) single-threaded usage

I use a custom heap implementation in one of my projects. It consists of two major parts:
Fixed size-block heap. I.e. a heap that allocates blocks of a specific size only. It allocates larger memory blocks (either virtual memory pages or from another heap), and then divides them into atomic allocation units.
It performs allocation/freeing fast (in O(1)) and there's no memory usage overhead, not taking into account things imposed by the external heap.
Global general-purpose heap. It consists of buckets of the above (fixed-size) heaps. WRT the requested allocation size it chooses the appropriate bucket, and performs the allocation via it.
Since the whole application is (heavily) multi-threaded - the global heap locks the appropriate bucket during its operation.
Note: in contrast to the traditional heaps, this heap requires the allocation size not only for the allocation, but also for freeing. This allows to identify the appropriate bucket without searches or extra memory overhead (such as saving the block size preceding the allocated block). Though somewhat less convenient, this is ok in my case. Moreover, since the "bucket configuration" is known at compile-time (implemented via C++ template voodoo) - the appropriate bucket is determined at compile time.
So far everything looks (and works) good.
Recently I worked on an algorithm that performs heap operations heavily, and naturally affected significantly by the heap performance. Profiling revealed that its performance is considerably impacted by the locking. That is, the heap itself works very fast (typical allocation involves just a few memory dereferencing instructions), but since the whole application is multi-threaded - the appropriate bucket is protected by the critical section, which relies on interlocked instructions, which are much heavier.
I've fixed this meanwhile by giving this algorithm its own dedicated heap, which is not protected by a critical section. But this imposes several problems/restrictions at the code level. Such as the need to pass the context information deep within the stack wherever the heap may be necessary. One may also use TLS to avoid this, but this may cause some problems with re-entrance in my specific case.
This makes me wonder: Is there a known technique to optimize the heap for (but not limit to) single-threaded usage?
EDIT:
Special thanks to #Voo for suggesting checking out the google's tcmalloc.
It seems to work similar to what I did more-or-less (at least for small objects). But in addition they solve the exact issue I have, by maintaining per-thread caching.
I too thought in this direction, but I thought about maintaining per-thread heaps. Then freeing a memory block allocated from the heap belonging to another thread is somewhat tricky: one should insert it in a sort of a locked queue, and that other thread should be notified, and free the pending allocations asynchronously. Asynchronous deallocation may cause problems: if that thread is busy for some reason (for instance performs an aggressive calculations) - no memory deallocation actually occurs. Plus in multi-threaded scenario the cost of deallocation is significantly higher.
OTOH the idea with caching seems much simpler, and more efficient. I'll try to work it out.
Thanks a lot.
P.S.:
Indeed google's tcmalloc is great. I believe it's implemented pretty much similar to what I did (at least fixed-size part).
But, to be pedantic, there's one matter where my heap is superior. According to docs, tcmalloc imposes an overhead roughly 1% (asymptotically), whereas my overhead is 0.0061%. It's 4/64K to be exact.
:)
One thought is to maintain a memory allocator per-thread. Pre-assign fairly chunky blocks of memory to each allocator from a global memory pool. Design your algorithm to assign the chunky blocks from adjacent memory addresses (more on that later).
When the allocator for a given thread is low on memory, it requests more memory from the global memory pool. This operation requires a lock, but should occur far less frequently than in your current case. When the allocator for a given thread frees it's last byte, return all memory for that allocator to the global memory pool (assume thread is terminated).
This approach will tend to exhaust memory earlier than your current approach (memory can be reserved for one thread that never needs it). The extent to which that is an issue depends on the thread creation / lifetime / destruction profile of your app(s). You can mitigate that at the expense of additional complexity, e.g. by introducing a signal that a memory allocator for given thread is out of memory, and the global pool is exhaused, that other memory allocators can respond to by freeing some memory.
An advantage of this scheme is that it will tend to eliminate false sharing, as memory for a given thread will tend to be allocated in contiguous address spaces.
On a side note, if you have not already read it, I suggest IBM's Inside Memory Management article for anyone implementing their own memory management.
UPDATE
If the goal is to have very fast memory allocation optimized for a multi-threaded environment (as opposed to learning how to do it yourself), have a look at alternate memory allocators. If the goal is learning, perhaps check out their source code.
Hoarde
tcmalloc (thanks Voo)
It might be a good idea to read Jeff Bonwicks classic papers on the slab allocator and vmem. The original slab allocator sounds somewhat what you're doing. Although not very multithread friendly it might give you some ideas.
The Slab Allocator: An Object-Caching Kernel Memory Allocator
Then he extended the concept with VMEM, which will definitely give you some ideas since it had very nice behavior in a multi cpu environment.
Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources

Dealing with fragmentation in a memory pool?

Suppose I have a memory pool object with a constructor that takes a pointer to a large chunk of memory ptr and size N. If I do many random allocations and deallocations of various sizes I can get the memory in such a state that I cannot allocate an M byte object contiguously in memory even though there may be a lot free! At the same time, I can't compact the memory because that would cause a dangling pointer on the consumers. How does one resolve fragmentation in this case?
I wanted to add my 2 cents only because no one else pointed out that from your description it sounds like you are implementing a standard heap allocator (i.e what all of us already use every time when we call malloc() or operator new).
A heap is exactly such an object, that goes to virtual memory manager and asks for large chunk of memory (what you call "a pool"). Then it has all kinds of different algorithms for dealing with most efficient way of allocating various size chunks and freeing them. Furthermore, many people have modified and optimized these algorithms over the years. For long time Windows came with an option called low-fragmentation heap (LFH) which you used to have to enable manually. Starting with Vista LFH is used for all heaps by default.
Heaps are not perfect and they can definitely bog down performance when not used properly. Since OS vendors can't possibly anticipate every scenario in which you will use a heap, their heap managers have to be optimized for the "average" use. But if you have a requirement which is similar to the requirements for a regular heap (i.e. many objects, different size....) you should consider just using a heap and not reinventing it because chances are your implementation will be inferior to what OS already provides for you.
With memory allocation, the only time you can gain performance by not simply using the heap is by giving up some other aspect (allocation overhead, allocation lifetime....) which is not important to your specific application.
For example, in our application we had a requirement for many allocations of less than 1KB but these allocations were used only for very short periods of time (milliseconds). To optimize the app, I used Boost Pool library but extended it so that my "allocator" actually contained a collection of boost pool objects, each responsible for allocating one specific size from 16 bytes up to 1024 (in steps of 4). This provided almost free (O(1) complexity) allocation/free of these objects but the catch is that a) memory usage is always large and never goes down even if we don't have a single object allocated, b) Boost Pool never frees the memory it uses (at least in the mode we are using it in) so we only use this for objects which don't stick around very long.
So which aspect(s) of normal memory allocation are you willing to give up in your app?
Depending on the system there are a couple of ways to do it.
Try to avoid fragmentation in the first place, if you allocate blocks in powers of 2 you have less a chance of causing this kind of fragmentation. There are a couple of other ways around it but if you ever reach this state then you just OOM at that point because there are no delicate ways of handling it other than killing the process that asked for memory, blocking until you can allocate memory, or returning NULL as your allocation area.
Another way is to pass pointers to pointers of your data(ex: int **). Then you can rearrange memory beneath the program (thread safe I hope) and compact the allocations so that you can allocate new blocks and still keep the data from old blocks (once the system gets to this state though that becomes a heavy overhead but should seldom be done).
There are also ways of "binning" memory so that you have contiguous pages for instance dedicate 1 page only to allocations of 512 and less, another for 1024 and less, etc... This makes it easier to make decisions about which bin to use and in the worst case you split from the next highest bin or merge from a lower bin which reduces the chance of fragmenting across multiple pages.
Implementing object pools for the objects that you frequently allocate will drive fragmentation down considerably without the need to change your memory allocator.
It would be helpful to know more exactly what you are actually trying to do, because there are many ways to deal with this.
But, the first question is: is this actually happening, or is it a theoretical concern?
One thing to keep in mind is you normally have a lot more virtual memory address space available than physical memory, so even when physical memory is fragmented, there is still plenty of contiguous virtual memory. (Of course, the physical memory is discontiguous underneath but your code doesn't see that.)
I think there is sometimes unwarranted fear of memory fragmentation, and as a result people write a custom memory allocator (or worse, they concoct a scheme with handles and moveable memory and compaction). I think these are rarely needed in practice, and it can sometimes improve performance to throw this out and go back to using malloc.
write the pool to operate as a list of allocations, you can then extended and destroyed as needed. this can reduce fragmentation.
and/or implement allocation transfer (or move) support so you can compact active allocations. the object/holder may need to assist you, since the pool may not necessarily know how to transfer types itself. if the pool is used with a collection type, then it is far easier to accomplish compacting/transfers.

Can i allocate memory faster by using multiple threads?

If i make a loop that reserves 1kb integer arrays, int[1024], and i want it to allocate 10000 arrays, can i make it faster by running the memory allocations from multiple threads?
I want them to be in the heap.
Let's assume that i have a multi-core processor for the job.
I already did try this, but it decreased the performance. I'm just wondering, did I just make bad code or is there something that i didn't know about memory allocation?
Does the answer depend on the OS? please tell me how it works on different platforms if so.
Edit:
The integer array allocation loop was just a simplified example. Don't bother telling me how I can improve that.
It depends on many things, but primarily:
the OS
the implementation of malloc you are using
The OS is responsible for allocating the "virtual memory" that your process has access to and builds a translation table that maps the virtual memory back to actual memory addresses.
Now, the default implementation of malloc is generally conservative, and will simply have a giant lock around all this. This means that requests are processed serially, and the only thing that allocating from multiple threads instead of one does is slowing down the whole thing.
There are more clever allocation schemes, generally based upon pools, and they can be found in other malloc implementations: tcmalloc (from Google) and jemalloc (used by Facebook) are two such implementations designed for high-performance in multi-threaded applications.
There is no silver bullet though, and at one point the OS must perform the virtual <=> real translation which requires some form of locking.
Your best bet is to allocate by arenas:
Allocate big chunks (arenas) at once
Split them up in arrays of the appropriate size
There is no need to parallelize the arena allocation, and you'll be better off asking for the biggest arenas you can (do bear in mind that allocation requests for a too large amount may fail), then you can parallelize the split.
tcmalloc and jemalloc may help a bit, however they are not designed for big allocations (which is unusual) and I do not know if it is possible to configure the size of the arenas they request.
The answer depends on the memory allocations routine, which are a combination of a C++ library layer operator new, probably wrapped around libC malloc(), which in turn occasionally calls an OS function such as sbreak(). The implementation and performance characteristics of all of these is unspecified, and may vary from compiler version to version, with compiler flags, different OS versions, different OSes etc.. If profiling shows it's slower, then that's the bottom line. You can try varying the number of threads, but what's probably happening is that the threads are all trying to obtain the same lock in order to modify the heap... the overheads involved with saying "ok, thread X gets the go ahead next" and "thread X here, I'm done" are simply wasting time. Another C++ environment might end up using atomic operations to avoid locking, which might or might not prove faster... no general rule.
If you want to complete faster, consider allocating one array of 10000*1024 ints, then using different parts of it (e.g. [0]..[1023], [1024]..[2047]...).
I think that perhaps you need to adjust your expectation from multi-threading.
The main advantage of multi-threading is that you can do tasks asynchronously, i.e. in parallel. In your case, when your main thread needs more memory it does not matter whether it is allocated by another thread - you still need to stop and wait for allocation to be accomplished, so there is no parallelism here. In addition, there is an overhead of a thread signaling when it is done and the other waiting for completion, which just can degrade the performance. Also, if you start a thread each time you need allocation this is a huge overhead. If not, you need some mechanism to pass the allocation request and response between threads, a kind of task queue which again is an overhead without gain.
Another approach could be that the allocating thread runs ahead and pre-allocates the memory that you will need. This can give you a real gain, but if you are doing pre-allocation, you might as well do it in the main thread which will be simpler. E.g. allocate 10M in one shot (or 10 times 1M, or as much contiguous memory as you can have) and have an array of 10,000 pointers pointing to it at 1024 offsets, representing your arrays. If you don't need to deallocate them independently of one another this seems to be much simpler and could be even more efficient than using multi-threading.
As for glibc it has arena's (see here), which has lock per arena.
You may also consider tcmalloc by google (stands for Thread-Caching malloc), which shows 30% boost performance for threaded application. We use it in our project. In debug mode it even can discover some incorrect usage of memory (e.g. new/free mismatch)
As far as I know all os have implicit mutex lock inside the dynamic allocating system call (malloc...). If you think a moment about that, if you do not lock this action you could run into terrible problems.
You could use the multithreading api threading building blocks http://threadingbuildingblocks.org/
which has a multithreading friendly scalable allocator.
But I think a better idea should be to allocate the whole memory once(should work quite fast) and split it up on your own. I think the tbb allocator does something similar.
Do something like
new int[1024*10000] and than assign the parts of 1024ints to your pointer array or what ever you use.
Do you understand?
Because the heap is shared per-process the heap will be locked for each allocation, so it can only be accessed serially by each thread. This could explain the decrease of performance when you do alloc from multiple threads like you are doing.
If the arrays belong together and will only be freed as a whole, you can just allocate an array of 10000*1024 ints, and then make your individual arrays point into it. Just remember that you cannot delete the small arrays, only the whole.
int *all_arrays = new int[1024 * 10000];
int *small_array123 = all_arrays + 1024 * 123;
Like this, you have small arrays when you replace the 123 with a number between 0 and 9999.
The answer depends on the operating system and runtime used, but in most cases, you cannot.
Generally, you will have two versions of the runtime: a multi-threaded version and a single-threaded version.
The single-threaded version is not thread-safe. Allocations made by two threads at the same time can blow your application up.
The multi-threaded version is thread-safe. However, as far as allocations go on most common implementations, this just means that calls to malloc are wrapped in a mutex. Only one thread can ever be in the malloc function at any given time, so attempting to speed up allocations with multiple threads will just result in a lock convoy.
It may be possible that there are operating systems that can safely handle parallel allocations within the same process, using minimal locking, which would allow you to decrease time spent allocating. Unfortunately, I don't know of any.

Can Win32 "move" heap-allocated memory?

I have a .NET/native C++ application. Currently, the C++ code allocates memory on the default heap which persists for the life of the application. Basically, functions/commands are executed in the C++ which results in allocation/modification of the current persistent memory. I am investigating an approach for cancelling one of these functions/commands mid-execution. We have hundreds of these commands, and many are very complicated (legacy) code.
The brute-force approach that I am trying to avoid is modifying each and every command/function to check for the cancellation and do all the appropriate clean-up (freeing heap memory). I am investigating a multi-threaded approach in which an additional thread receives the cancellation request and terminates the command-execution thread. I would want all dynamic memory to be allocated on a "private heap" using HeapCreate() (Win32). This way, the private heap could be destroyed by the thread handling the cancellation request. However, if the command runs to completion, I need the dynamic memory to persist. In this case, I would like to do the logical equivalent of "moving" the private heap memory to the default/process heap without incurring the cost of an actual copy. Is this in any way possible? Does this even make sense?
Alternatively, I recognize that I could just have a new private heap for every command/function execution (each will be a new thread). The private heap could be destroyed if the command is cancelled, or it would survive if the command completes. Is there any problem with the number of heaps growing indefinitely? I know there is some overhead involved with each heap. What limitations might I run into?
I am running on Windows 7 64-bit with 8GB RAM (consider this the target platform). The application I am working with is about 1 million SLOC (half C++, half C#). I am looking for any experience/suggestions with private heap management, or just alternatives to my solution.
You might be better off with separate processes instead of separate threads:
use memory mapped files (ie not a file at all - just cross-processed shared memory)
killing a process is 'cleaner' than killing a thread
I think you can have the shared memory 'survive' the killing without a move - you map/unmap instead of move
although you might need to do some memory management on your own.
Anyhow, worth looking into. I was looking into using inter-process memory for a few other things, and it had some unusual properties (can recall all of it clearly, it was a while ago), and you might be able to take advantage of it.
Just an idea!
From MSDN's Heap Functions page:
"Memory allocated by HeapAlloc is not movable. The address returned by HeapAlloc is valid until the memory block is freed or reallocated; the memory block does not need to be locked."
Can you re-link the legacy apps against your own malloc() implementation? If so, you should be able to manage without modifying the rest of the code. Your custom malloc library can track allocated blocks by thread, and have a "FreeAllByThreadId() function which you call after killing the legacy function's thread. You could use private heaps inside the library.
An alternative to private heaps might be doing your own allocation from memory-mapped files. See "Creating Named Shared Memory." You create the shared memory while initializing the alloc library for the legacy thread. On success, map it into the main thread so your c# can access it; on termination, close it and it is released to the system.
Heap is a sort of big chunk of memory. It is a user-level memory manager. A heap is created by lower-level system memory calls (e.g., sbrk in Linux and VirtualAlloc in Windows). In a a heap, then you can request or return a small chunk of memory by malloc/new/free/delete. By default, a process has a single heap (unlike stack, all threads share a heap). But, you can have many heaps.
Is it possible to combine two heaps w/o copying? A heap is essentially a data structure that maintains a list of used and freed memory chunks. So, a heap should have a sort of bookkeeping data called meta data. Of course, this meta data is per heap. AFAIK, no heap manager supports a merge operation of two heaps. I had reviewed entire source code of malloc implementation in Linux glibc (Doug Lea's implementation), but no such operation. Windows Heap* functions are also implemented in a similar way. So, it is currently impossible to move or merge two separate heaps.
Is it possible to have many heaps? I don't think there should be a big problem to have many heaps. As I said before, a heap is just a data structure that keeps used/freed memory chunks. So, there should be some amount of overhead. But, it's not that severe. When you look at one of malloc implementation, there is malloc_state, which is a basic data structure for each heap. For example, you can create another heap by create_mspace (in Windows, it is HeapCreate), then you will get a new malloc state. It's not that big. So, if this tread-off (some heap overhead vs. implementation easiness) is fine, then you may go on.
If I were you, I'll try the way you describe. It makes sense to me. Having a lot of heap objects would not make a big overhead.
Also, it should be noted that technically moving memory regions is impossible. Pointers that pointed the moved memory region will result in dangling pointers.
p.s. Your problem seems like a transaction, especially Software Transactional Memory. A typical implementation of STM buffers pending memory writes, and then commits to the real system memory it the transaction had no conflict.
No. Memory cannot be moved between heaps.