When HeapCreate function is used or in what cases do you need a number of heaps?

When HeapCreate function is used or in what cases do you need a number of heaps? - c++

Windows API has a set of function for heap creation and handling: HeapCreate, HeapAlloc, HeapDestroy and etc.
I wonder what is the use for another heap in a program?
From fragmentation point of view, you will get external fragmentation where memory is not reused among heaps. So even if low-fragmentation heaps are used, stil there is a fragmentation.
Memory management of additional heaps seems to be low-level. So they are not easy to use.
In addition, additional heap can probably be emulated using allocations from heap and managing allocated memory.
So what is the usage? Did you use it?

One use case might be a long-running complex process that does a lot of memory allocation and deallocation. If the user wants to interrupt the process, then an easy way to clean up the memory currently allocated might be to have everything on a private heap and then simply destroy the heap.
I have seen this technique used in an embedded system (which wasn't using Windows, so it didn't use those exact API functions). The custom memory allocator had a feature to "mark" a specific state of the heap and then "rewind" to that point if a process was aborted.

One reason that is only important in rare situations, but immensely important there: memory allocated by new/malloc isn't executable on modern Windows systems. Hence if you write for example a JIT, you will have to use HeapCreate with HEAP_CREATE_ENABLE_EXECUTE.

Use: Very very very rarily.
Usage:
I once worked on a projected that used the heap management as a crude garbage collector (no destructors). There was a section of the code that went off a did some work without regard to memory management (using a separate heap). Then when it was done we just destroyed that heap to re-claim all the memory.

One use is for fixed size objects. If you need to do a lot of allocation/deallocation of objects that are all the same size (i.e. small message buffers) a private heap avoids fragmentation issues.

You might also dedicate a heap per thread - for locality of reference or to reduce locking (which is required when a heap is shared across threads).

One use case I see more often than not is in malware.
The malware would have a packed binary somewhere in its .rsrc section, allocate an executable private heap, and then run the code there. Its a very effective technique

One usage not mentioned here is to avoid heap contention.
You could create a thread-local heap which is not thread-safe, passing HEAP_NO_SERIALIZE flag to HeapCreate.
Since only one thread can access the heap, no locks are required and contention is alleviated.

Related

C++ pmr polymorphic memory resources choice supports to release as needed

My program is a daemon and runs for a long time. Only some of the time it will needs a lot of memory resource.
I want to increase my program performance by increasing memory locality.
And PMR seems like a good tool for this purpose.
However, it seems that the memory resources provided by the standard does not return the memory to upstream when there are lots of memory currently not used.
(i.e. synchronized_pool_resource, unsynchronized_pool_resource, monotonic_buffer_resource)
I want that my program can use less memory when the load is not high. (kinds of like calling malloc_trim when needed)
Is there a memory resource that will only cache small amount of currenly un-used memory, and return the rest to upstream.

A memory resource can be written to do whatever you want. However, since what you've described (returning memory that is unused) is what the default allocator does (and is one of the main reasons to use it), there wasn't much point in adding more standard library memory resources that do this.
Most of the defined memory resources are all about not returning unused memory, because returning and reallocating memory is expensive. They provide different strategies for keeping that memory accessible, so that later allocation calls are as fast as possible. That is, their whole point is to avoid the cost of allocating memory from the heap.
So you'll have to write a resource with the functionality you're looking for.

Heap optimized for (but not limited to) single-threaded usage

I use a custom heap implementation in one of my projects. It consists of two major parts:
Fixed size-block heap. I.e. a heap that allocates blocks of a specific size only. It allocates larger memory blocks (either virtual memory pages or from another heap), and then divides them into atomic allocation units.
It performs allocation/freeing fast (in O(1)) and there's no memory usage overhead, not taking into account things imposed by the external heap.
Global general-purpose heap. It consists of buckets of the above (fixed-size) heaps. WRT the requested allocation size it chooses the appropriate bucket, and performs the allocation via it.
Since the whole application is (heavily) multi-threaded - the global heap locks the appropriate bucket during its operation.
Note: in contrast to the traditional heaps, this heap requires the allocation size not only for the allocation, but also for freeing. This allows to identify the appropriate bucket without searches or extra memory overhead (such as saving the block size preceding the allocated block). Though somewhat less convenient, this is ok in my case. Moreover, since the "bucket configuration" is known at compile-time (implemented via C++ template voodoo) - the appropriate bucket is determined at compile time.
So far everything looks (and works) good.
Recently I worked on an algorithm that performs heap operations heavily, and naturally affected significantly by the heap performance. Profiling revealed that its performance is considerably impacted by the locking. That is, the heap itself works very fast (typical allocation involves just a few memory dereferencing instructions), but since the whole application is multi-threaded - the appropriate bucket is protected by the critical section, which relies on interlocked instructions, which are much heavier.
I've fixed this meanwhile by giving this algorithm its own dedicated heap, which is not protected by a critical section. But this imposes several problems/restrictions at the code level. Such as the need to pass the context information deep within the stack wherever the heap may be necessary. One may also use TLS to avoid this, but this may cause some problems with re-entrance in my specific case.
This makes me wonder: Is there a known technique to optimize the heap for (but not limit to) single-threaded usage?
EDIT:
Special thanks to #Voo for suggesting checking out the google's tcmalloc.
It seems to work similar to what I did more-or-less (at least for small objects). But in addition they solve the exact issue I have, by maintaining per-thread caching.
I too thought in this direction, but I thought about maintaining per-thread heaps. Then freeing a memory block allocated from the heap belonging to another thread is somewhat tricky: one should insert it in a sort of a locked queue, and that other thread should be notified, and free the pending allocations asynchronously. Asynchronous deallocation may cause problems: if that thread is busy for some reason (for instance performs an aggressive calculations) - no memory deallocation actually occurs. Plus in multi-threaded scenario the cost of deallocation is significantly higher.
OTOH the idea with caching seems much simpler, and more efficient. I'll try to work it out.
Thanks a lot.
P.S.:
Indeed google's tcmalloc is great. I believe it's implemented pretty much similar to what I did (at least fixed-size part).
But, to be pedantic, there's one matter where my heap is superior. According to docs, tcmalloc imposes an overhead roughly 1% (asymptotically), whereas my overhead is 0.0061%. It's 4/64K to be exact.
:)

One thought is to maintain a memory allocator per-thread. Pre-assign fairly chunky blocks of memory to each allocator from a global memory pool. Design your algorithm to assign the chunky blocks from adjacent memory addresses (more on that later).
When the allocator for a given thread is low on memory, it requests more memory from the global memory pool. This operation requires a lock, but should occur far less frequently than in your current case. When the allocator for a given thread frees it's last byte, return all memory for that allocator to the global memory pool (assume thread is terminated).
This approach will tend to exhaust memory earlier than your current approach (memory can be reserved for one thread that never needs it). The extent to which that is an issue depends on the thread creation / lifetime / destruction profile of your app(s). You can mitigate that at the expense of additional complexity, e.g. by introducing a signal that a memory allocator for given thread is out of memory, and the global pool is exhaused, that other memory allocators can respond to by freeing some memory.
An advantage of this scheme is that it will tend to eliminate false sharing, as memory for a given thread will tend to be allocated in contiguous address spaces.
On a side note, if you have not already read it, I suggest IBM's Inside Memory Management article for anyone implementing their own memory management.
UPDATE
If the goal is to have very fast memory allocation optimized for a multi-threaded environment (as opposed to learning how to do it yourself), have a look at alternate memory allocators. If the goal is learning, perhaps check out their source code.
Hoarde
tcmalloc (thanks Voo)

It might be a good idea to read Jeff Bonwicks classic papers on the slab allocator and vmem. The original slab allocator sounds somewhat what you're doing. Although not very multithread friendly it might give you some ideas.
The Slab Allocator: An Object-Caching Kernel Memory Allocator
Then he extended the concept with VMEM, which will definitely give you some ideas since it had very nice behavior in a multi cpu environment.
Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources

Dealing with fragmentation in a memory pool?

Suppose I have a memory pool object with a constructor that takes a pointer to a large chunk of memory ptr and size N. If I do many random allocations and deallocations of various sizes I can get the memory in such a state that I cannot allocate an M byte object contiguously in memory even though there may be a lot free! At the same time, I can't compact the memory because that would cause a dangling pointer on the consumers. How does one resolve fragmentation in this case?

I wanted to add my 2 cents only because no one else pointed out that from your description it sounds like you are implementing a standard heap allocator (i.e what all of us already use every time when we call malloc() or operator new).
A heap is exactly such an object, that goes to virtual memory manager and asks for large chunk of memory (what you call "a pool"). Then it has all kinds of different algorithms for dealing with most efficient way of allocating various size chunks and freeing them. Furthermore, many people have modified and optimized these algorithms over the years. For long time Windows came with an option called low-fragmentation heap (LFH) which you used to have to enable manually. Starting with Vista LFH is used for all heaps by default.
Heaps are not perfect and they can definitely bog down performance when not used properly. Since OS vendors can't possibly anticipate every scenario in which you will use a heap, their heap managers have to be optimized for the "average" use. But if you have a requirement which is similar to the requirements for a regular heap (i.e. many objects, different size....) you should consider just using a heap and not reinventing it because chances are your implementation will be inferior to what OS already provides for you.
With memory allocation, the only time you can gain performance by not simply using the heap is by giving up some other aspect (allocation overhead, allocation lifetime....) which is not important to your specific application.
For example, in our application we had a requirement for many allocations of less than 1KB but these allocations were used only for very short periods of time (milliseconds). To optimize the app, I used Boost Pool library but extended it so that my "allocator" actually contained a collection of boost pool objects, each responsible for allocating one specific size from 16 bytes up to 1024 (in steps of 4). This provided almost free (O(1) complexity) allocation/free of these objects but the catch is that a) memory usage is always large and never goes down even if we don't have a single object allocated, b) Boost Pool never frees the memory it uses (at least in the mode we are using it in) so we only use this for objects which don't stick around very long.
So which aspect(s) of normal memory allocation are you willing to give up in your app?

Depending on the system there are a couple of ways to do it.
Try to avoid fragmentation in the first place, if you allocate blocks in powers of 2 you have less a chance of causing this kind of fragmentation. There are a couple of other ways around it but if you ever reach this state then you just OOM at that point because there are no delicate ways of handling it other than killing the process that asked for memory, blocking until you can allocate memory, or returning NULL as your allocation area.
Another way is to pass pointers to pointers of your data(ex: int **). Then you can rearrange memory beneath the program (thread safe I hope) and compact the allocations so that you can allocate new blocks and still keep the data from old blocks (once the system gets to this state though that becomes a heavy overhead but should seldom be done).
There are also ways of "binning" memory so that you have contiguous pages for instance dedicate 1 page only to allocations of 512 and less, another for 1024 and less, etc... This makes it easier to make decisions about which bin to use and in the worst case you split from the next highest bin or merge from a lower bin which reduces the chance of fragmenting across multiple pages.

Implementing object pools for the objects that you frequently allocate will drive fragmentation down considerably without the need to change your memory allocator.

It would be helpful to know more exactly what you are actually trying to do, because there are many ways to deal with this.
But, the first question is: is this actually happening, or is it a theoretical concern?
One thing to keep in mind is you normally have a lot more virtual memory address space available than physical memory, so even when physical memory is fragmented, there is still plenty of contiguous virtual memory. (Of course, the physical memory is discontiguous underneath but your code doesn't see that.)
I think there is sometimes unwarranted fear of memory fragmentation, and as a result people write a custom memory allocator (or worse, they concoct a scheme with handles and moveable memory and compaction). I think these are rarely needed in practice, and it can sometimes improve performance to throw this out and go back to using malloc.

write the pool to operate as a list of allocations, you can then extended and destroyed as needed. this can reduce fragmentation.
and/or implement allocation transfer (or move) support so you can compact active allocations. the object/holder may need to assist you, since the pool may not necessarily know how to transfer types itself. if the pool is used with a collection type, then it is far easier to accomplish compacting/transfers.

Is memory allocation a system call?

Is memory allocation a system call? For example, malloc and new. Is the heap shared by different processes and managed by the OS. What about private heap? If memory allocation in the heap is managed by the OS, how expensive is this?
I would also like to have some link to places where I can read more about this topic.

In general, malloc and new do not perform a system call at each invocation. However, they use a lower-level mechanism to allocate large pages of memory. On Windows, the lower mechanism is VirtualAlloc(). I believe on POSIX systems, this is somewhat equivalent to mmap(). Both of these perform a system call to allocate memory to the process at the OS level. Subsequent allocations will use smaller parts of those large pages without incurring a system call.
The heap is normally inner-process and is not shared between processes. If you need this, most OSes have an API for allocating shared memory. A portable wrapper for these APIs is available in the Boost.Interprocess library.
If you would like to learn more about memory allocation and relationship with the OS, you should take a look at a good book on operating systems. I always suggest Modern Operating Systems by Andrew S. Tanenbaum as it is very easy to read.

(Assuming an operating system with memory protection. Might not be the case e.g. in embedded devices.)
Is memory allocation a system call?
Not necessarily each allocation. The process needs to call the kernel if its heap is not large enough for the requested allocation already, but C libraries usually request larger chunks when they do so, with the aim to reduce the number of system calls.
Is the heap shared by different processes and managed by the OS. What about private heap?
The heap is not shared between processes. It's shared between threads though.
How expensive kernel memory allocation system calls are depends entirely on the OS. Since that's a very common thing, you can expect it to be efficient under normal circumstances. Things get complicated in low RAM situations.

See the layered memory management in Win32.
Memory allocation is always a system call but the allocation is made as pages. If there are space available in the committed pages, memory manager will allocate the requested space without changing the kernel mode. The best thing about HeapAlloc is, it provides fine control over the allocation where Virtual Alloc round the allocation for a single page. It may result in excessive usage in memory.
Basically the default heap and private heaps are treated same except the default heap size is specified during the linking time. The default heap size is 1 MB and grows as required.
See this article for more details
You can also find more information in this thread

Memory allocation functions and language statements like malloc/free and new/delete are not a system calls. Malloc\free is a part of the C\C++ library and new\delete is a part of C++ runtime system. Calls of both can occasionally lead to the system calls. In the other languages memory allocation implemented in the similar way.
In general memory management can't be implemented without involving OS at all, because memory is one of the main system resources, and due to this global memory management made by OS kernel. But due to the fact that the system calls are relatively expensive, peoples try to design languages and memory allocation libraries in such a way to minimize amount of system calls.
As I know heap is an intra-process entity. That is mean that all memory allocation/deallocation requests are managed entirely by process itself. Operating system knows only the heap location and size and services two types of requests from the intra-process memory management system:
add memory page at virtual address X
release memory page from virtual address X
Local memory management system request these services when it decides that it haven't enough memory in the memory pool of heap and when it decides that it have too much memory in the memory pool of heap.
Despite the fact that the memory allocation is usually designed in such a way to minimize amount of system calls it still stay about order more expensive then memory allocation on stack. This is because the memory allocation\deallocation algorithms of heap are much more complex and expensive than the same of stack.

Can Win32 "move" heap-allocated memory?

I have a .NET/native C++ application. Currently, the C++ code allocates memory on the default heap which persists for the life of the application. Basically, functions/commands are executed in the C++ which results in allocation/modification of the current persistent memory. I am investigating an approach for cancelling one of these functions/commands mid-execution. We have hundreds of these commands, and many are very complicated (legacy) code.
The brute-force approach that I am trying to avoid is modifying each and every command/function to check for the cancellation and do all the appropriate clean-up (freeing heap memory). I am investigating a multi-threaded approach in which an additional thread receives the cancellation request and terminates the command-execution thread. I would want all dynamic memory to be allocated on a "private heap" using HeapCreate() (Win32). This way, the private heap could be destroyed by the thread handling the cancellation request. However, if the command runs to completion, I need the dynamic memory to persist. In this case, I would like to do the logical equivalent of "moving" the private heap memory to the default/process heap without incurring the cost of an actual copy. Is this in any way possible? Does this even make sense?
Alternatively, I recognize that I could just have a new private heap for every command/function execution (each will be a new thread). The private heap could be destroyed if the command is cancelled, or it would survive if the command completes. Is there any problem with the number of heaps growing indefinitely? I know there is some overhead involved with each heap. What limitations might I run into?
I am running on Windows 7 64-bit with 8GB RAM (consider this the target platform). The application I am working with is about 1 million SLOC (half C++, half C#). I am looking for any experience/suggestions with private heap management, or just alternatives to my solution.

You might be better off with separate processes instead of separate threads:
use memory mapped files (ie not a file at all - just cross-processed shared memory)
killing a process is 'cleaner' than killing a thread
I think you can have the shared memory 'survive' the killing without a move - you map/unmap instead of move
although you might need to do some memory management on your own.
Anyhow, worth looking into. I was looking into using inter-process memory for a few other things, and it had some unusual properties (can recall all of it clearly, it was a while ago), and you might be able to take advantage of it.
Just an idea!

From MSDN's Heap Functions page:
"Memory allocated by HeapAlloc is not movable. The address returned by HeapAlloc is valid until the memory block is freed or reallocated; the memory block does not need to be locked."
Can you re-link the legacy apps against your own malloc() implementation? If so, you should be able to manage without modifying the rest of the code. Your custom malloc library can track allocated blocks by thread, and have a "FreeAllByThreadId() function which you call after killing the legacy function's thread. You could use private heaps inside the library.
An alternative to private heaps might be doing your own allocation from memory-mapped files. See "Creating Named Shared Memory." You create the shared memory while initializing the alloc library for the legacy thread. On success, map it into the main thread so your c# can access it; on termination, close it and it is released to the system.

Heap is a sort of big chunk of memory. It is a user-level memory manager. A heap is created by lower-level system memory calls (e.g., sbrk in Linux and VirtualAlloc in Windows). In a a heap, then you can request or return a small chunk of memory by malloc/new/free/delete. By default, a process has a single heap (unlike stack, all threads share a heap). But, you can have many heaps.
Is it possible to combine two heaps w/o copying? A heap is essentially a data structure that maintains a list of used and freed memory chunks. So, a heap should have a sort of bookkeeping data called meta data. Of course, this meta data is per heap. AFAIK, no heap manager supports a merge operation of two heaps. I had reviewed entire source code of malloc implementation in Linux glibc (Doug Lea's implementation), but no such operation. Windows Heap* functions are also implemented in a similar way. So, it is currently impossible to move or merge two separate heaps.
Is it possible to have many heaps? I don't think there should be a big problem to have many heaps. As I said before, a heap is just a data structure that keeps used/freed memory chunks. So, there should be some amount of overhead. But, it's not that severe. When you look at one of malloc implementation, there is malloc_state, which is a basic data structure for each heap. For example, you can create another heap by create_mspace (in Windows, it is HeapCreate), then you will get a new malloc state. It's not that big. So, if this tread-off (some heap overhead vs. implementation easiness) is fine, then you may go on.
If I were you, I'll try the way you describe. It makes sense to me. Having a lot of heap objects would not make a big overhead.
Also, it should be noted that technically moving memory regions is impossible. Pointers that pointed the moved memory region will result in dangling pointers.
p.s. Your problem seems like a transaction, especially Software Transactional Memory. A typical implementation of STM buffers pending memory writes, and then commits to the real system memory it the transaction had no conflict.

No. Memory cannot be moved between heaps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js