Smaller per allocation overhead in allocators library - c++

I'm currently writing a memory management library for c++ that is based around the concept of allocators. It's relatively simple, for now, all allocators implement these 2 member functions:
virtual void * alloc( std::size_t size ) = 0;
virtual void dealloc( void * ptr ) = 0;
As you can see, I do not support alignment in the interface but that's actually my next step :) and the reason why I'm asking this question.
I want the allocators to be responsible for alignment because each one can be specialized. For example, the block allocator can only return block-sized aligned memory so it can handle failure and return NULL if a different alignment is asked for.
Some of my allocators are in fact sub-allocators. For example, one of them is a linear/sequential allocator that just pointer-bumps on allocation. This allocator is constructed by passing in a char * pBegin and char * pEnd and it allocates from within that region in memory. For now, it works great but I get stuff that is 1-byte aligned. It works on x86 but I heard it can be disastrous on other CPUs (consoles?). It's also somewhat slower for reads and writes on x86.
The only sane way I know of implementing aligned memory management is to allocate an extra sizeof( void * ) + (alignement - 1) bytes and do pointer bit-masking to return the aligned address while keeping the original allocated address in the bytes before the user-data (the void * bytes, see above).
OK, my question...
That overhead, per allocation, seems big to me. For 4-bytes alignment, I would have 7 bytes of overhead on a 32-bit cpu and 11 bytes on a 64-bit one. That seems like a lot.
First, is it a lot? Am I on par with other memory management libs you might have used in the past or are currently using? I've looked into malloc and it seems to have a minimum of 16-byte overhead, is that right?
Do you know of a better way, smaller overhead, of returning aligned memory to my lib's users?

You could store an offset, rather than a pointer, which would only need to be large enough to store the largest supported alignment. A byte might even be sufficient if you only support smallish alignments.

How about you implement a buddy system which can be x-byte aligned depending on your requirement.
General Idea:
When your lib is initialized, allocate a big chunk of memory. For our example lets assume 16B. (Only this block needs to be aligned, the algorithm will not require you to align any other block)
Maintain lists for memory chunks of power 2. i.e 4B, 8B, 16B, ... 64KB, ... 1MB, 2MB, ... 512MB.
If a user asks for 8B of data, check the list for 8B, if not available, check list of 16B and split it into 2 blocks of 8B. Give one back to user and the other, gets appended to the list of 8B.
If the user asks for 16B, check if you have at least 2 8B available. If yes, combine them and give back to user. If not, the system does not have enough memory.
Pros:
No internal or external fragmentation.
No alignment required.
Fast access to memory chunks as they are pre-allocated.
If the list is an array, direct access to memory chunks of different size
Cons:
Overhead for list of memory.
If the list is a linked-list, traversal would be slow.

Related

Allocate small struct to 32 bit aligned in 64 bit system

The problem: I'm implementing a non-blocking data structure, where threads alter a shared pointer using a CAS operation. As pointers can be recycled, we have the ABA issue. To avoid this, I want to attach a version to each pointer. This is called a versioned pointer. A CAS128 is considered more expensive than a CAS64, so I'm trying to avoid going above 64 bits.
I'm trying to implement a versioned pointer. In a 32b system, the versioned pointer is a 64b struct, where the top 32 bits are the pointer and the bottom 32 is its version. This allows me to use CAS64 to atomically alter the pointer.
I'm having issues with a 64b system. In this case, I still want to use CAS64 instead of CAS128, so I'm trying to allocate a pointer aligned to 4GB (i.e., 32 zeros). I can then use masks to infer the pointer and version.
The solutions I've tried using alligned_malloc, padding, and std::align, but these involve allocating very large amounts of memory, e.g., alligned_malloc(1LL << 32, (1LL << 32)* sizeof(void*)) allocates 4GB of memory. Another solution is using a memory mapped file, but this involves synchronization that we're trying to avoid.
Is there a way to allocate 8B of memory aligned to 4GB that I'm missing?
First off, a non-portable solution that limits the code complexity creep to the point of allocation (see below for another approach that makes point of use more complicated, but should be portable); it only works on POSIX systems (not Windows), but you could reduce your overhead to the size of a page (not 8 bytes, but in the context of a 64 bit system, wasting 4088 bytes isn't too bad if you're not doing it too often; obviously, the nature of your problem means that you can't possibly waste more than sysconf(_SC_PAGESIZE) - 8 bytes per 4 GB, so that's not too bad) by the following mechanism:
mmap 4 GB of memory anonymously (not file-backed; pass fd of -1 and include the MAP_ANONYMOUS flag)
Compute the address of the 4 GB aligned pointer within that block
munmap the memory preceding that address, and the memory beginning sysconf(_SC_PAGE_SIZE) bytes after that address
This works because memory mappings aren't monolithic; they can be unmapped piecemeal, individual pages can be remapped without error, etc.
Note that if you're short on swap space, the brief request for 4 GB might cause problems (e.g. on a Linux system with heuristic overcommit disabled, it might fail to allocate the memory if it can't back it with swap, even though you never use most of it). You can experiment with passing MAP_NORESERVE to the original request, then performing the unmapping, then remapping that single page with MAP_FIXED (without MAP_NORESERVE) to ensure the allocation can be used without triggering a SIGSEGV at time of writing.
If you can't use POSIX mmap, should it really be impossible to use CAS128, you may want to consider a segmented memory model like the old x86 scheme for these pointers. You block allocate 4 GB segments (they don't need any special alignment) up front, and have your "pointers" be 32 bit offsets from the base address of the segment; you can't use the whole 64 bit address space (unless you allow for multiple selectors, possibly by repurposing part of the version number field for example; you can probably make do with a few million versions rather than four billion after all), but if you don't need to do so, this lets you have a base address that never changes after allocation (so no atomics needed), with offsets that fit within your desired 32 bit field. So instead of getting your data via:
data = *mystruct.pointer;
you have a segment pointer like this initialized early:
char *const base_address = new char[1ULL << 32]; // Or use smart pointer of your choosing
wrap it in a suballocator to partition the space, and now lookup is instead:
data = *reinterpret_cast<REAL_TYPE_HERE*>(&base_address[mystruct.pointer]);
I'm sure there are nifty ways to wrap this up better with custom allocators, custom operator news, what have you, but I've never had to do this in C++ (I've done similar magic in C, where there are no facilities to make it "pretty"), and I'd probably get it wrong, so I'll leave that as an exercise.

How much is there overhead per single object memory allocation? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Say, if I call malloc(sizeof(int)), requesting 4 bytes, how much extra will be added by system (or std library?) to support memory management infrastructure? I believe there should be some. Otherwise, how the system would know how many bytes to dispose of when I call free(ptr).
UPDATE 1: It may sound like a 'too broad question' and, obviously, a C/C++ library specific, but what I am interested in, is a minimum extra memory needed to support a single allocation. Even not a system or implementation specific. For example, for binary tree, there is a must of 2 pointers - left and right children, and no way you can squeeze it.
UPDATE 2:
I decided to check it for myself on Windows 64.
#include <stdio.h>
#include <conio.h>
#include <windows.h>
#include <psapi.h>
void main(int argc, char *argv[])
{
int m = (argc > 1) ? atoi(argv[1]) : 1;
int n = (argc > 2) ? atoi(argv[2]) : 0;
for (int i = 0; i < n; i++)
malloc(m);
size_t peakKb(0);
PROCESS_MEMORY_COUNTERS pmc;
if ( GetProcessMemoryInfo(GetCurrentProcess(), &pmc, sizeof(pmc)) )
peakKb = pmc.PeakWorkingSetSize >> 10;
printf("requested : %d kb, total: %d kb\n", (m*n) >> 10, peakKb);
_getch();
}
requested : 0 kb, total: 2080 kb
1 byte:
requested : 976 kb, total: 17788 kb
extra: 17788 - 2080 - 976 = 14732 (+1410%)
2 bytes:
requested : 1953 kb, total: 17784 kb
extra: 17784 - 2080 - 1953 = (+605% over)
4 bytes:
requested : 3906 kb, total: 17796 kb
extra: 17796 - 2080 - 3906 = 10810 (+177%)
8 bytes:
requested : 7812 kb, total: 17784 kb
extra: 17784 - 2080 - 7812 = (0%)
UPDATE 3: THIS IS THE ANSWER TO MY QUESTION I’VE BEEN LOOKING FOR: In addition to being slow, the genericity of the default C++ allocator makes it very space inefficient for small objects. The default allocator manages a pool of memory, and such management often requires some extra memory. Usually, the bookkeeping memory amounts to a few extra bytes (4 to 32) for each block allocated with new. If you allocate 1024-byte blocks, the per-block space overhead is insignificant (0.4% to 3%). If you allocate 8-byte objects, the per-object overhead becomes 50% to 400%, a figure big enough to make you worry if you allocate many such small objects.
For allocated objects, no additional metadata is theoretically required. A conforming implementation of malloc could round up all allocation requests to a fixed maximum object size, for example. So for malloc (25), you would actually receive a 256-byte buffer, and malloc (257) would fail and return a null pointer.
More realistically, some malloc implementations encode the allocation size in the pointer itself, either directly using bit patterns corresponding to specific fixed sized classes, or indirectly using a hash table or a multi-level trie. If I recall correctly, the internal malloc for Address Sanitizer is of this type. For such mallocs, at least part of the immediate allocation overhead does not come from the addition of metadata for heap management, but from rounding the allocation size up to a supported size class.
Other mallocs have a per-allocation header of a single word. (dlmalloc and its derivative are popular examples). The actual per-allocation overhead is usually slightly larger because due to the header word, you get weired supported allocation sizes (such as 24, 40, 56, … bytes with 16-byte alignment on a 64-bit system).
One thing to keep in mind is that many malloc implementations put a lot of data deallocated objects (which have not yet been returned to the operating system kernel), so that malloc (the function) can quickly find an unused memory region of the appropriate size. Particularly for dlmalloc-style allocators, this also provides a constraint on minimum object size. The use of deallocated objects for heap management contributes to malloc overhead, too, but its impact on individual allocations is difficult to quantify.
Say, if I call malloc(sizeof(int)), requesting 4 bytes, how much extra will be added by system (or std library?) to support memory management infrastructure? I believe there should be some. Otherwise, how the system would know how many bytes to dispose of when I call free(ptr).
This is entirely library specific. The answer could be anything from zero to whatever. Your library could add data to the front of the block. Some add data to the front and back of the block to track overwrites. The amount of overhead added varies among libraries.
The length could be tracked within the library itself with a table. In that case, there may not be hidden field added to the allocated memory.
The library might only allocate blocks in fixed sizes. The amount you ask for gets rounded up to the next block size.
The pointer itself is essentially overhead and can be a dominant driver of memory use in some programs.
The theoretical minimum overhead, might be sizeof(void*) for some theoretical system and use, but that combination of CPU, Memory and usage pattern is so unlikely to exist as to be absolutely worthless for consideration. The standard requires that memory returned by malloc be suitably aligned for any data type, therefore there will always be some overhead; in the form of unused memory between the end of one allocated block, and the beginning of the next allocated block, except in the rare cases where all memory usage is sized to a multiple of the block size.
The minimum implementation of malloc/free/realloc, assumes the heap manager has one contiguous block of memory at it's disposal, located somewhere in system memory, the pointer that said heap manager uses to reference that original block, is overhead (again sizeof(void*)). One could imagine a highly contrived application that requested that entire block of memory, thus avoiding the need for additional tracking data. At this point we have 2 * sizeof(void*) worth of overhead, one internal to the heap manager, plus the returned pointer to the one allocated block (the theoretical minimum). Such a conforming heap manager is unlikely to exist as it must also have some means of allocating more than one block from its pool and that implies at a minimum, tracking which blocks within its pool are in use.
One scheme to avoid overhead involves using pointer sizes that are larger than the physical or logical memory available to the application. One can store some information in those unused bits, but they would also count as overhead if their number exceeds a processor word size. Generally, only a hand full of bits are used and those identify which of the memory managers internal pools the memory comes from. The later, implies additonal overhead of pointers to pools. This brings us to real world systems, where the heap manager implementation is tuned to the OS, hardware architecture and typical usage patterns.
Most hosted implementations (hosted == runs on OS) request one or more memory blocks from the operating system in the c-runtime initialization phase. OS memory management calls are expensive in time and space. The OS has it's own memory manager, with its own overhead set, driven by its own design criteria and usage patterns. So c-runtime heap managers attempt to limit the number of calls into the OS memory manager, in order to reduce the latency of the average call to malloc() and free(). Most request the first block from the OS when malloc is first called, but this usually happens at some point in the c-runtime initialization code. This first block is usually a low multiple of the system page size, which can be one or more orders of magnitude larger than the size requested in the initial malloc() call.
At this point it is obvious that heap manager overhead is extremely fluid and difficult to quantify. On a typical modern system, the heap manager must track multiple blocks of memory allocated from the OS, how many bytes are currently allocated to the application in each of those blocks and potentially, how much time has passed since a block went to zero. Then there's the overhead of tracking allocations from within each of those blocks.
Generally malloc rounds up to a minimum alignment boundary and often this is not special cased for small allocations as applications are expected to aggregate many of these into a single allocation. The minimum alignment is often based on the largest required alignment for a load instruction in the architecture the code is running on. So with 128-bit SIMD (e.g. SSE or NEON) the minimum is 16 bytes. In practice there is a header as well which causes the minimum cost in size to be doubled. As SIMD register widths have increased, malloc hasn't increased it's guaranteed alignment.
As was pointed out, the minimum possible overhead is 0. Though the pointer itself should probably be counted in any reasonable analysis. In a garbage collector design, at least one pointer to the data has to be present. In straight a non-GC design, one has to have a pointer to call free, but there's not an iron clad requirement it has to be called. One could theoretically compress a bunch of pointers together into less space as well, but now we're into an analysis of the entropy of the bits in the pointers. Point being you likely need to specify some more constraints to get a really solid answer.
By way of illustration, if one needs arbitrary allocation and deallocation of just int size, one can allocate a large block and create a linked list of indexes using each int to hold the index of the next. Allocation pulls an item off the list and deallocation adds one back. There is a constraint that each allocation is exactly an int. (And that the block is small enough that the maximum index fits in an int.) Multiple sizes can be handled by having different blocks and searching for which block the pointer is in when deallocation happens. Some malloc implementations do something like this for small fixed sizes such as 4, 8, and 16 bytes.
This approach doesn't hit zero overhead as one needs to maintain some data structure to keep track of the blocks. This is illustrated by considering the case of one-byte allocations. A block can at most hold 256 allocations as that is the maximum index that can fit in the block. If we want to allow more allocations than this, we will need at least one pointer per block, which is e.g. 4 or 8 bytes overhead per 256 bytes.
One can also use bitmaps, which amortize to one bit per some granularity plus the quantization of that granularity. Whether this is low overhead or not depends on the specifics. E.g. one bit per byte has no quantization but eats one eighth the allocation size in the free map. Generally this will all require storing the size of the allocation.
In practice allocator design is hard because the trade-off space between size overhead, runtime cost, and fragmentation overhead is complicated, often has large cost differences, and is allocation pattern dependent.

C++ allocates more bytes than asked?

int main(int argc, const char* argv[])
{
for(unsigned int i = 0; i < 10000000; i++)
char *c = new char;
cin.get();
}
In the above code, why does my program use 471MB memory instead of 10MB as one would expect?
Allocation of RAM comes from a combined effort of the runtime library and the operating system. In order to identify the one byte your example code requests, there is some structure which identifies this memory to the runtime. It is sometimes a double linked list, but it's defined by the operating system and runtime implementation.
You can analogize it this way: If you have a linked list container, what you're interested in is simply what you've placed inside each link, but the container must have pointers to the other links in the containers in order to maintain the linked list.
If you use a debugger, or some other debugging tool to track memory, these structures can be even larger, making each allocation more costly.
RAM isn't typically allocated out of an array, but it is possible to overload the new operator to change allocation behavior. It could be possible specifically allocate from an array (a large one in your example) so that allocations behaved as you seem to have expected, and in some applications this is a specific strategy to control memory and improve performance (though the details are usually more complex than that simple illustration).
The allocation not only contains the allocated memory itself, but at least one word telling delete how much memory it has to release; moreover that is a number that has to be correctly aligned, so there will be a certain padding after the allocated char to ensure that the next block is correctly aligned. On a 64 bit machine, that means at least 16 bytes per allocation (8 bytes to hold the size, 1 byte to hold the character, and 7 bytes padding to ensure correct alignment).
However most probably that's not the only data stored; to help the memory allocator to find free memory, additional data is likely stored; if one assumes that data to consist of three pointers, one gets to a total 40 bytes per allocation, which matches your data quite well.
Note also that the allocator will also request a bit more memory from the operating system than needed for the actual allocation, so that it won't need to do an expensive OS call for every little allocation. That is, the run time library allocates larger chunks of memory from the operating system, and then cuts those in smaller pieces for your program's allocations. Thus generally there will be some memory allocated from the operating system (and thus showing up in the task manager), but not yet allocated to a certain object in your program.

Is there a memory overhead associated with heap memory allocations (eg markers in the heap)?

Thinking in particular of C++ on Windows using a recent Visual Studio C++ compiler, I am wondering about the heap implementation:
Assuming that I'm using the release compiler, and I'm not concerned with memory fragmentation/packing issues, is there a memory overhead associated with allocating memory on the heap? If so, roughly how many bytes per allocation might this be?
Would it be larger in 64-bit code than 32-bit?
I don't really know a lot about modern heap implementations, but am wondering whether there are markers written into the heap with each allocation, or whether some kind of table is maintained (like a file allocation table).
On a related point (because I'm primarily thinking about standard-library features like 'map'), does the Microsoft standard-library implementation ever use its own allocator (for things like tree nodes) in order to optimize heap usage?
Yes, absolutely.
Every block of memory allocated will have a constant overhead of a "header", as well as a small variable part (typically at the end). Exactly how much that is depends on the exact C runtime library used. In the past, I've experimentally found it to be around 32-64 bytes per allocation. The variable part is to cope with alignment - each block of memory will be aligned to some nice even 2^n base-address - typically 8 or 16 bytes.
I'm not familiar with how the internal design of std::map or similar works, but I very much doubt they have special optimisations there.
You can quite easily test the overhead by:
char *a, *b;
a = new char;
b = new char;
ptrdiff_t diff = a - b;
cout << "a=" << a << " b=" << b << " diff=" << diff;
[Note to the pedants, which is probably most of the regulars here, the above a-b expression invokes undefined behaviour, since subtracting the address of one piece of allocated and the address of another, is undefined behaviour. This is to cope with machines that don't have linear memory addresses, e.g. segmented memory or "different types of data is stored in locations based on their type". The above should definitely work on any x86-based OS that doesn't use a segmented memory model with multiple data segments in for the heap - which means it works for Windows and Linux in 32- and 64-bit mode for sure].
You may want to run it with varying types - just bear in mind that the diff is in "number of the type, so if you make it int *a, *b will be in "four bytes units". You could make a reinterpret_cast<char*>(a) - reinterpret_cast<char *>(b);
[diff may be negative, and if you run this in a loop (without deleting a and b), you may find sudden jumps where one large section of memory is exhausted, and the runtime library allocated another large block]
Visual C++ embeds control information (links/sizes and possibly some checksums) near the boundaries of allocated buffers. That also helps to catch some buffer overflows during memory allocation and deallocation.
On top of that you should remember that malloc() needs to return pointers suitably aligned for all fundamental types (char, int, long long, double, void*, void(*)()) and that alignment is typically of the size of the largest type, so it could be 8 or even 16 bytes. If you allocate a single byte, 7 to 15 bytes can be lost to alignment only. I'm not sure if operator new has the same behavior, but it may very well be the case.
This should give you an idea. The precise memory waste can only be determined from the documentation (if any) or testing. The language standard does not define it in any terms.
Yes. All practical dynamic memory allocators have a minimal granularity1. For example, if the granularity is 16 bytes and you request only 1 byte, the whole 16 bytes is allocated nonetheless. If you ask for 17 bytes, a block whose size is 32 bytes is allocated etc...
There is also a (related) issue of alignment.2
Quite a few allocators seem to be a combination of a size map and free lists - they split potential allocation sizes to "buckets" and keep a separate free list for each of them. Take a look at Doug Lea's malloc. There are many other allocation techniques with various tradeoffs but that goes beyond the scope here...
1 Typically 8 or 16 bytes. If the allocator uses a free list then it must encode two pointers inside every free slot, so a free slot cannot be smaller than 8 bytes (on 32-bit) or 16 byte (on 16-bit). For example, if allocator tried to split a 8-byte slot to satisfy a 4-byte request, the remaining 4 bytes would not have enough room to encode the free list pointers.
2 For example, if the long long on your platform is 8 bytes, then even if the allocator's internal data structures can handle blocks smaller than that, actually allocating the smaller block might push the next 8-byte allocation to an unaligned memory address.

Extreme memory usage for individual dynamic allocation

here's a simple test I did on MSVC++ 2010 under windows 7:
// A struct with sizeof(s) == 4, e.g 4 bytes
struct s
{
int x;
};
// Allocate 1 million structs
s* test1 = new s[1000000];
// Memory usage show that the increase in memory is roughly 4 bytes * 1000000 - As expected
// NOW! If I run this:
for (int i = 0; i < 1000000; i++)
new s();
// The memory usage is disproportionately large. When divided by 1000000, indicates 64 bytes per s!!!
Is this a common knowledge or am I missing something? Before I always used to create objects on the fly when needed. For example new Triangle() for every triangle in a mesh, etc.
Is there indeed order of magnitude overhead for dynamic memory allocation of individual instances?
Cheers
EDIT:
Just compiled and ran same program at work on Windows XP using g++:
Now the overhead is 16 bytes, not 64 as observed before! Very interesting.
Not necessarily, but the operating system will usually reserve memory on your behalf in whatever sized chunks it finds convenient; on your system, I'd guess it gives you multiples of 64 bytes per request.
There is an overhead associated with keeping track of the memory allocations, after all, and reserving very small amounts isn't worthwhile.
Is that for a debug build? Because in a debug build msvc will allocate "guards" around objects to see if you overwrite past your object boundary.
There is usually overhead with any single memory allocation. Now this is from my knowledge of malloc rather than new but I suspect it's the same.
A section of the memory arena, when carved out for an allocation of (say) 30 bytes, will typically have a header (e.g., 16 bytes, and all figures like that are examples only below, they may be different) and may be padded to a multiple of 16 bytes for easier arena management.
The header is usually important to allow the section to be re-integrated into the free memory pool when you're finished with it.
It contains information about the size of the block at a bare minimum and may have memory guards as well (to detect corruption of the arena).
So, when you allocate your one million structure array, you'll find that it uses an extra 16 bytes for the header (four million and sixteen bytes). When you try to allocate one million individual structures, each and every one of them will have that overhead.
I answered a related question here with more details. I suspect there will be more required header information for C++ since it will probably have to store the number of items over and above the section size (for proper destructor calls) but that's just supposition on my part. It doesn't affect the fact that accounting information of some sort is needed per allocated item.
If you really want to see what the space is being used for, you'll need to dig through the MSVC runtime source code.
You should check the malloc implementation. Probably this will clear things up.
Not sure though if MSVC++'s malloc can be viewed somewhere. If not, look at some other implementation, they are probably similar to some degree.
Don't expect the malloc implementation to be easy. It needs to search for some free space in the allocated virtual pages or allocate a new virtual page. And it must do this fast. As fast as possible. And it must be multithreading safe. Maybe your malloc implementation has some sort of bitvector where it safes which 64 bit chunks are free in some page and it just takes the next free chunk.