How much is there overhead per single object memory allocation? [closed]

How much is there overhead per single object memory allocation? [closed] - c++

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Say, if I call malloc(sizeof(int)), requesting 4 bytes, how much extra will be added by system (or std library?) to support memory management infrastructure? I believe there should be some. Otherwise, how the system would know how many bytes to dispose of when I call free(ptr).
UPDATE 1: It may sound like a 'too broad question' and, obviously, a C/C++ library specific, but what I am interested in, is a minimum extra memory needed to support a single allocation. Even not a system or implementation specific. For example, for binary tree, there is a must of 2 pointers - left and right children, and no way you can squeeze it.
UPDATE 2:
I decided to check it for myself on Windows 64.
#include <stdio.h>
#include <conio.h>
#include <windows.h>
#include <psapi.h>
void main(int argc, char *argv[])
{
int m = (argc > 1) ? atoi(argv[1]) : 1;
int n = (argc > 2) ? atoi(argv[2]) : 0;
for (int i = 0; i < n; i++)
malloc(m);
size_t peakKb(0);
PROCESS_MEMORY_COUNTERS pmc;
if ( GetProcessMemoryInfo(GetCurrentProcess(), &pmc, sizeof(pmc)) )
peakKb = pmc.PeakWorkingSetSize >> 10;
printf("requested : %d kb, total: %d kb\n", (m*n) >> 10, peakKb);
_getch();
}
requested : 0 kb, total: 2080 kb
1 byte:
requested : 976 kb, total: 17788 kb
extra: 17788 - 2080 - 976 = 14732 (+1410%)
2 bytes:
requested : 1953 kb, total: 17784 kb
extra: 17784 - 2080 - 1953 = (+605% over)
4 bytes:
requested : 3906 kb, total: 17796 kb
extra: 17796 - 2080 - 3906 = 10810 (+177%)
8 bytes:
requested : 7812 kb, total: 17784 kb
extra: 17784 - 2080 - 7812 = (0%)
UPDATE 3: THIS IS THE ANSWER TO MY QUESTION I’VE BEEN LOOKING FOR: In addition to being slow, the genericity of the default C++ allocator makes it very space inefficient for small objects. The default allocator manages a pool of memory, and such management often requires some extra memory. Usually, the bookkeeping memory amounts to a few extra bytes (4 to 32) for each block allocated with new. If you allocate 1024-byte blocks, the per-block space overhead is insignificant (0.4% to 3%). If you allocate 8-byte objects, the per-object overhead becomes 50% to 400%, a figure big enough to make you worry if you allocate many such small objects.

For allocated objects, no additional metadata is theoretically required. A conforming implementation of malloc could round up all allocation requests to a fixed maximum object size, for example. So for malloc (25), you would actually receive a 256-byte buffer, and malloc (257) would fail and return a null pointer.
More realistically, some malloc implementations encode the allocation size in the pointer itself, either directly using bit patterns corresponding to specific fixed sized classes, or indirectly using a hash table or a multi-level trie. If I recall correctly, the internal malloc for Address Sanitizer is of this type. For such mallocs, at least part of the immediate allocation overhead does not come from the addition of metadata for heap management, but from rounding the allocation size up to a supported size class.
Other mallocs have a per-allocation header of a single word. (dlmalloc and its derivative are popular examples). The actual per-allocation overhead is usually slightly larger because due to the header word, you get weired supported allocation sizes (such as 24, 40, 56, … bytes with 16-byte alignment on a 64-bit system).
One thing to keep in mind is that many malloc implementations put a lot of data deallocated objects (which have not yet been returned to the operating system kernel), so that malloc (the function) can quickly find an unused memory region of the appropriate size. Particularly for dlmalloc-style allocators, this also provides a constraint on minimum object size. The use of deallocated objects for heap management contributes to malloc overhead, too, but its impact on individual allocations is difficult to quantify.

Say, if I call malloc(sizeof(int)), requesting 4 bytes, how much extra will be added by system (or std library?) to support memory management infrastructure? I believe there should be some. Otherwise, how the system would know how many bytes to dispose of when I call free(ptr).
This is entirely library specific. The answer could be anything from zero to whatever. Your library could add data to the front of the block. Some add data to the front and back of the block to track overwrites. The amount of overhead added varies among libraries.
The length could be tracked within the library itself with a table. In that case, there may not be hidden field added to the allocated memory.
The library might only allocate blocks in fixed sizes. The amount you ask for gets rounded up to the next block size.

The pointer itself is essentially overhead and can be a dominant driver of memory use in some programs.
The theoretical minimum overhead, might be sizeof(void*) for some theoretical system and use, but that combination of CPU, Memory and usage pattern is so unlikely to exist as to be absolutely worthless for consideration. The standard requires that memory returned by malloc be suitably aligned for any data type, therefore there will always be some overhead; in the form of unused memory between the end of one allocated block, and the beginning of the next allocated block, except in the rare cases where all memory usage is sized to a multiple of the block size.
The minimum implementation of malloc/free/realloc, assumes the heap manager has one contiguous block of memory at it's disposal, located somewhere in system memory, the pointer that said heap manager uses to reference that original block, is overhead (again sizeof(void*)). One could imagine a highly contrived application that requested that entire block of memory, thus avoiding the need for additional tracking data. At this point we have 2 * sizeof(void*) worth of overhead, one internal to the heap manager, plus the returned pointer to the one allocated block (the theoretical minimum). Such a conforming heap manager is unlikely to exist as it must also have some means of allocating more than one block from its pool and that implies at a minimum, tracking which blocks within its pool are in use.
One scheme to avoid overhead involves using pointer sizes that are larger than the physical or logical memory available to the application. One can store some information in those unused bits, but they would also count as overhead if their number exceeds a processor word size. Generally, only a hand full of bits are used and those identify which of the memory managers internal pools the memory comes from. The later, implies additonal overhead of pointers to pools. This brings us to real world systems, where the heap manager implementation is tuned to the OS, hardware architecture and typical usage patterns.
Most hosted implementations (hosted == runs on OS) request one or more memory blocks from the operating system in the c-runtime initialization phase. OS memory management calls are expensive in time and space. The OS has it's own memory manager, with its own overhead set, driven by its own design criteria and usage patterns. So c-runtime heap managers attempt to limit the number of calls into the OS memory manager, in order to reduce the latency of the average call to malloc() and free(). Most request the first block from the OS when malloc is first called, but this usually happens at some point in the c-runtime initialization code. This first block is usually a low multiple of the system page size, which can be one or more orders of magnitude larger than the size requested in the initial malloc() call.
At this point it is obvious that heap manager overhead is extremely fluid and difficult to quantify. On a typical modern system, the heap manager must track multiple blocks of memory allocated from the OS, how many bytes are currently allocated to the application in each of those blocks and potentially, how much time has passed since a block went to zero. Then there's the overhead of tracking allocations from within each of those blocks.

Generally malloc rounds up to a minimum alignment boundary and often this is not special cased for small allocations as applications are expected to aggregate many of these into a single allocation. The minimum alignment is often based on the largest required alignment for a load instruction in the architecture the code is running on. So with 128-bit SIMD (e.g. SSE or NEON) the minimum is 16 bytes. In practice there is a header as well which causes the minimum cost in size to be doubled. As SIMD register widths have increased, malloc hasn't increased it's guaranteed alignment.
As was pointed out, the minimum possible overhead is 0. Though the pointer itself should probably be counted in any reasonable analysis. In a garbage collector design, at least one pointer to the data has to be present. In straight a non-GC design, one has to have a pointer to call free, but there's not an iron clad requirement it has to be called. One could theoretically compress a bunch of pointers together into less space as well, but now we're into an analysis of the entropy of the bits in the pointers. Point being you likely need to specify some more constraints to get a really solid answer.
By way of illustration, if one needs arbitrary allocation and deallocation of just int size, one can allocate a large block and create a linked list of indexes using each int to hold the index of the next. Allocation pulls an item off the list and deallocation adds one back. There is a constraint that each allocation is exactly an int. (And that the block is small enough that the maximum index fits in an int.) Multiple sizes can be handled by having different blocks and searching for which block the pointer is in when deallocation happens. Some malloc implementations do something like this for small fixed sizes such as 4, 8, and 16 bytes.
This approach doesn't hit zero overhead as one needs to maintain some data structure to keep track of the blocks. This is illustrated by considering the case of one-byte allocations. A block can at most hold 256 allocations as that is the maximum index that can fit in the block. If we want to allow more allocations than this, we will need at least one pointer per block, which is e.g. 4 or 8 bytes overhead per 256 bytes.
One can also use bitmaps, which amortize to one bit per some granularity plus the quantization of that granularity. Whether this is low overhead or not depends on the specifics. E.g. one bit per byte has no quantization but eats one eighth the allocation size in the free map. Generally this will all require storing the size of the allocation.
In practice allocator design is hard because the trade-off space between size overhead, runtime cost, and fragmentation overhead is complicated, often has large cost differences, and is allocation pattern dependent.

Related

mmap: enforce 64K alignment

I'm porting a project written (by me) for Windows to mobile platforms.
I need an equivalent of VirtualAlloc (+friends), and the natural one is mmap. However there are 2 significant differences.
Addresses returned by VirtualAlloc are guaranteed to be multiples of the so-called allocation granularity (dwAllocationGranularity). Not to be confused with the page size, this number is arbitrary, and on most Windows system is 64K. In contrast address returned by mmap is only guaranteed to be page-aligned.
The reserved/allocated region may be freed at once by a call to VirtualFree, and there's no need to pass the allocation size (that is, the size used in VirtualAlloc). In contrast munmap should be given the exact region size to be unmapped, i.e. it frees the given number of memory pages without any relation about how they were allocated.
This imposes problems for me. While I could live with (2), the (1) is a real problem. I don't want to get into details, but assuming the much smaller allocation granularity, such as 4K, will lead to a serious efficiency degradation. This is related to the fact that my code needs to put some information at every granularity boundary within the allocated regions, which impose "gaps" within the contiguous memory region.
I need to solve this. I can think about pretty naive methods of allocating increased regions, so that they can be 64K-aligned and still have adequate size. Or alternatively reserve huge regions of virtual address space, and then allocating properly-aligned memory regions (i.e. implement a sort of an aligned heap). But I wonder if there are alternatives. Such as special APIs, maybe some flags, secret system calls or whatever.

(1) is actually quite easy to solve. As you note, munmap takes a size parameter, and this is because munmap is capable of partial deallocation. You can thus allocate a chunk of memory that is bigger than what you need, then deallocate the parts that aren't aligned.

If possible, use posix_memalign (which despite the name allocates rather than aligns memory); this allows you to specify an alignment, and the memory allocated can be released using free. You would need to check whether posix_memalign is provided by your target platform.

Address of start block storage

The allocation function attempts to allocate the requested amount of
storage.
If it is successful, it shall return the address of the start of a
block of storage whose length in bytes shall be at least as large as
the requested size.
What does that constraint mean? Could you get an example it violates?
It seems, my question is unclear.
UPD:
Why "at least"? What is the point of allocation more than requested size? Could you get suitable example?

The allowance for allocating "more than required" is there to allow:
Good alignment of the next block of data.
Reduce restrictions on what platforms are able to run code compiled from C and C++.
Flexibility in the design of the memory allocation functionality.
An example of point one is:
char *p1 = new char[1];
int *p2 = new int[1];
If we allocate exactly 1 byte at address 0x1000 for the first allocation, and follow that exactly with a second allocation of 4 bytes for an int, the int will start at address 0x1001. This is "valid" on some architectures, but often leads to a "slower load of the value", on other architectures it will directly lead to a crash, because an int is not accessible on an address that isn't an even multiple of 4. Since the underlying architecture of new doesn't actually know what the memory is eventually going to be used for, it's best to allocate it at "the highest alignment", which in most architectures means 8 or 16 bytes. (If the memory is used for example to store SSE data, it will need an alignment of 16 bytes)
The second case would be where "pointers can only point to whole blocks of 32-bit words". There have been architectures like that in the past. In this case, even if we ignore the above problem with alignment, the memory location specified by a generic pointer is two parts, one for the actual address, and one for the "which byte within that word". In the memory allocator, since typical allocations are much larger than a single byte, we decide to only use the "full word" pointer, so all allocations are by design always rounded up to whole words.
The third case, for example, would be to use a "pre-sized block" allocator. Some real-time OS's for example will have a fixed number of predefined sizes that they allocate - for example 16, 32, 64, 256, 1024, 16384, 65536, 1M, 16M bytes. Allocations are then rounded up to the nearest equal or larger size, so an allocation for 257 bytes would be allocated from the 1024 size. The idea here is to a) provide fast allocation, by keeping track of free blocks in each size, rather than the traditional model of having a large number of blocks in any size to search through to see if there is a big enough block. It also helps against fragmentation (when lots of memory is "free", but the wrong size, so can't be used - for example, if run a loop until the system is out of memory that allocates blocks of 64 bytes, then free every other, and try to allocate a 128 byte block, there is not a single 128 byte block free, because ALL of the memory is carved up into little 64-byte sections).

It means that the allocation function shall return an address of a block of memory whose size is at least the size you requested.
Most of the allocation functions, however, shall return an address of a memory block whose size is bigger than the one you requested and the next allocations will return an address inside this block until it reaches the end of it.
The main reasons for this behavior are:
Minimize the number of new memory block allocations (each block can contain several allocations), which are expensive in terms of time complexity.
Specific alignment issues.

The two most common reasons why allocation
returns a block larger than requested is
Alignment
Bookkeeping

It may be such because in modern operating systems it is much effective to allocate memory page that can be of 512 kb. So internal malloc functionality can just alloc this page of memory, fill its beginning with some service information as to in how many sub-blocks it's divided into, theirs sizes, etc. And only after that it returns to you an address suitable for your needs. Next call to malloc will return another portion of this allocated page for instance. It doesn't matter that the block of memory is restricted to the size you requested. In fact you can heap-overflow this buffer because of no safe mechanism to prevent this sort of activity. Also you can consider alignment questions that other responders stated above. There are enough of types of memory management. You can google it if you are interested enough(Best Fit, First Fit, Last Fit, correct me if I'm wrong).

Why the growth direction of heap address become opposite when allocating larger space?

I was doing some experiments on the heap address growth, and something interesting happened.
(OS: CentOS, )
But I don't understand, why this happened? Thanks!
This is what I did first:
double *ptr[1000];
for (int i=0;i<1000;i++){
ptr[i] = new double[**10000**];
cout << ptr[i] << endl;
}
The output is incremental(for the last few lines):
....
....
0x2481be0
0x2495470
0x24a8d00
0x24bc590
0x24cfe20
0x24e36b0
0x24f6f40
0x250a7d0
0x251e060
Then I changed 10000 to 20000:
double *ptr[1000];
for (int i=0;i<1000;i++){
ptr[i] = new double[**20000**];
cout << ptr[i] << endl;
}
The address became more like the address of stack space(and decremental):
....
....
0x7f69c4d8a010
0x7f69c4d62010
0x7f69c4d3a010
0x7f69c4d12010
0x7f69c4cea010
0x7f69c4cc2010
0x7f69c4c9a010
0x7f69c4c72010
0x7f69c4c4a010
0x7f69c4c22010
0x7f69c4bfa010
0x7f69c4bd2010
0x7f69c4baa010
0x7f69c4b82010

Different environments/implementations allocate memory using different strategies, so there is no one correct rule. However, a common pattern is to use different allocation strategies for small objects vs. large objects.
Often, a runtime will have multiple heaps for objects of different sizes, which are optimized for different usage patterns. For example, small objects tend to be allocated often and deleted quickly, while large objects tend to be created rarely and have a long life.
If you use a single heap for everything, then a few small objects will be quickly peppered throughout your memory space, leaving lots of medium sized blocks available but few or no large blocks needed for larger objects. This is referred to as memory fragmentation, and can cause your allocation to fail even if nominally your app has tons of memory available.
Another reason to use different heaps is to use a different usage tracking method for different object sizes. For example, an implementation might request a new memory block from the OS for large objects, and for small objects, use a few smaller OS memory blocks with sub-allocations handled by the C runtime heap manager. Memory usage tracking mechanisms that are very effective for large objects can be very expensive for smaller ones because the memory used for tracking usage becomes a significant fraction of the actual memory used by each object.
In your case, my guess is that the runtime is allocating small objects at the beginning of the memory space, bottom-up, and larger ones near the end, top-down, to avoid fragmentation.

You're not going to get a great answer here, because the new function can choose any method it wants to allocate memory. My guess would be that the algorithm here broke the pool into small and large allocation pools, and the big allocation pool grows downward so they can meet in the middle (so as to not waste any space).

On UNIX, allocators use sbrk(2) and mmap(2) to get memory from the OS. The addresses returned by sbrk are well defined, but the addresses from mmap are "whatever is available". On Windows, allocators use VirtualAlloc() which is kind of like mmap.

Implementations are free to have hybrids of different allocation schemes. In C++, it's normal for there to be thousands - even millions - of relatively small objects, so it can make sense for the library's memory allocation routines to make sure they pack well and are very lightweight. Your allocations for 10000 doubles do that: they're 80016 bytes apart - 80000 for 10000 8 byte variables and just 16 bytes padding. Node specifically that the size has no relationship to powers of two, whereas when allocation 20000 doubles they're decrementing by 163840 bytes each time... weirdly, exactly 10 * 2^14. That suggests to me that the former allocations are being satisfied from one heap designed to support efficient small-object allocation by the C++ allocation new function, while the latter has crossed a theshold and is probably being sent to malloc for memory coming from a distinct heap, with much more waste.

You were lucky in the sense that the sizes of 10000 doubles and 20000 doubles happen to lie on the opposite sides of a critical thresholds called MMAP_THRESHOLD.
MMAP_THRESHOLD is 128KB by default. So, 80KB (i.e., 10000 doubles) mem alloc requests are serviced over heap, and whereas 160KB (20000 doubles) mem alloc requests are serviced by anonymous memory mapping (through mmap sys call). (Note that using mem mapping for large mem alloc may incur additional penalties due to its different underlying mem alloc handling mechanism. You may want to tune MMAP_THRESHOLD for optimal performance of your apps.)
In Linux Man for malloc:
Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2). When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous mapping using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is adjustable using mallopt(3). Allocations performed using mmap(2) are unaffected by the RLIMIT_DATA resource limit (see getrlimit(2)).

Memory calculation of objects inaccurate?

I'm creating a small cache daemon, and I want to limit its memory usage to approximately a specified amount. However, there seems to be an issue just trying to calculate how much memory is in use.
Every time a CacheEntry object is created, it adds the size of a CacheEntry object (apparently 64 bytes) plus the number of bytes used in internal arrays to the counter for how many bytes are in use. When the CacheEntry object is deleted, it subtracts that amount. I can confirm that the math, at least, is correct.
However, when run inside NetBeans, the memory profiler reports vastly different numbers. Roughly twice as high, to be specific. It is not a memory leak, and it is specifically related to the amount of CacheEntry objects currently in existence. Increasing the amount of data stored in the internal arrays actually brings the numbers closer together (as opposed to further apart, if that were being improperly calculated); from this, I have concluded that the overhead of having a CacheEntry object in memory is almost twice what sizeof() is reporting. It does not rise in steps or "chunks".
Is there some common reason why this might happen?
UPDATE: Just to check, I ran my tests without a profiler in place. Linux reports the same VmHWM/VmRSS either way, so the memory profiler is definitely not affecting the calculations.

Perhaps the profiler is adding reference objects to track the objects? Do you see the same results when you run the application in release vs Debug?

Is there some common reason why this might happen?
Yeah, that could be internal fragmentation and overhead of the memory manager. If your data type is small (eg. sizeof(CacheEntry) is 8 bytes), newing such data type might produce a bigger chunk of memory. It is partly used for malloc's internal bookkeeping (it usually stores the size of the block somewhere), partly for padding needed to align your data type on its natural boundary (eg. 8 bytes data + 4 bytes bookkeeping + 4 bytes padding needed to align the whole thing on 8-byte boundary).
You can solve it by allocating from a single continuous array of CacheEntry (eg. CacheEntry array[1000] takes exactly 1000*sizeof(CacheEntry) bytes). You'd have to track the usage of the individual elements in the array, but that should be doable without additional memory. (eg. by running a free-list of entries in the place of the free entries).

This memory bloat is caused by use of new, specifically on relatively small objects. On Windows, dynamically allocated memory incurs a 16- or 24-byte overhead each time; I haven't found the exact numbers for Linux, but it's roughly the same. This is because each allocated chunk needs to record its location and size (possibly more than once) so that it can be accurately freed later.
As far as I'm aware, the running program also does not know exactly how much overhead is involved in this, at least in any way accessible to the programmer.
Generally speaking, large quantities of small objects should use a memory pool, both for speed and memory conservation.

Extreme memory usage for individual dynamic allocation

here's a simple test I did on MSVC++ 2010 under windows 7:
// A struct with sizeof(s) == 4, e.g 4 bytes
struct s
{
int x;
};
// Allocate 1 million structs
s* test1 = new s[1000000];
// Memory usage show that the increase in memory is roughly 4 bytes * 1000000 - As expected
// NOW! If I run this:
for (int i = 0; i < 1000000; i++)
new s();
// The memory usage is disproportionately large. When divided by 1000000, indicates 64 bytes per s!!!
Is this a common knowledge or am I missing something? Before I always used to create objects on the fly when needed. For example new Triangle() for every triangle in a mesh, etc.
Is there indeed order of magnitude overhead for dynamic memory allocation of individual instances?
Cheers
EDIT:
Just compiled and ran same program at work on Windows XP using g++:
Now the overhead is 16 bytes, not 64 as observed before! Very interesting.

Not necessarily, but the operating system will usually reserve memory on your behalf in whatever sized chunks it finds convenient; on your system, I'd guess it gives you multiples of 64 bytes per request.
There is an overhead associated with keeping track of the memory allocations, after all, and reserving very small amounts isn't worthwhile.

Is that for a debug build? Because in a debug build msvc will allocate "guards" around objects to see if you overwrite past your object boundary.

There is usually overhead with any single memory allocation. Now this is from my knowledge of malloc rather than new but I suspect it's the same.
A section of the memory arena, when carved out for an allocation of (say) 30 bytes, will typically have a header (e.g., 16 bytes, and all figures like that are examples only below, they may be different) and may be padded to a multiple of 16 bytes for easier arena management.
The header is usually important to allow the section to be re-integrated into the free memory pool when you're finished with it.
It contains information about the size of the block at a bare minimum and may have memory guards as well (to detect corruption of the arena).
So, when you allocate your one million structure array, you'll find that it uses an extra 16 bytes for the header (four million and sixteen bytes). When you try to allocate one million individual structures, each and every one of them will have that overhead.
I answered a related question here with more details. I suspect there will be more required header information for C++ since it will probably have to store the number of items over and above the section size (for proper destructor calls) but that's just supposition on my part. It doesn't affect the fact that accounting information of some sort is needed per allocated item.
If you really want to see what the space is being used for, you'll need to dig through the MSVC runtime source code.

You should check the malloc implementation. Probably this will clear things up.
Not sure though if MSVC++'s malloc can be viewed somewhere. If not, look at some other implementation, they are probably similar to some degree.
Don't expect the malloc implementation to be easy. It needs to search for some free space in the allocated virtual pages or allocate a new virtual page. And it must do this fast. As fast as possible. And it must be multithreading safe. Maybe your malloc implementation has some sort of bitvector where it safes which 64 bit chunks are free in some page and it just takes the next free chunk.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js