What is considered a small object in C++? - c++

I've read about Small-Object Allocation in "Modern C++ Design". Andrei Alexandrescu argues that the general purpose operators (new and delete) perform badly for allocating small objects.
In my program there are lot of objects created and destroyed on the free store. These objects measure more than 8000 bytes.
What size is considered small? Are 8000 bytes small or big when it comes to memory allocation in C++?

The definition of "small" varies, but generally speaking, an object can be considered "small" if its size is smaller than the size overhead caused by heap allocation (or is at least close to that size).
So, a 16 byte object would probably be considered "small" in most cases. A 32 byte object might be considered small under certain specific circumstances. An 8,000 byte object is definitely not "small."
Usually, if you are going to go through the trouble of using a small object allocator, you are looking to improve the performance of some block of code. If using a small object allocator doesn't help your performance, you probably shouldn't use one.

[...] argues that the general purpose operators (new and delete) perform badly for allocating small objects.
Platform dependent. E.g. on Linux I once benchmarked my homegrown AVL tree with static memory management vs. GNU's std::map which is red-black tree and fully dynamic memory management. To my surprise, std::map sometimes outran my own highly-efficient implementation. And std::map does hazardous amount of small memory allocations.
In my program there are lot of objects created and destroyed on the free store.
Memory management concern is a valid one. In a sense that you should always try to reuse existing resources if possible or avoid creation of temporary copies.
That's if efficiency/CPU performance you are after. If the code is ran rarely, then it is pointless to bother.
These objects measure more than 8000 bytes. What size is considered small? Are 8000 bytes small or big when it comes to memory allocation in C++?
This is pointless question. If your program needs an object taking 8K, then your program needs it. Period.
You should start worry only if you would receive complains that software takes too much RAM or profiler points to the performance bottleneck in memory management. Otherwise, modern memory management is relatively fast and robust.
P.S. I personally would consider 8K to be an average memory allocation size. Not small - not big. But I'm already used to work with programs which on a whim allocate 10+GB on heap. If data set has to be resident in RAM and it is 10GB in size, well, then application has little choice but to try to load it.

I certainly wouldn't consider 8000 bytes to be 'small'. Small objects most likely means objects occupying no more than a few hundred bytes - objects amounting to a handful of bytes are going to lead to the biggest problems - however as KennyTM points out this is implementation dependent and some C++ runtimes may well be great at handling small objects.

The question is not, how small is the object, but, how much memory overhead is invoked by operator new/delete? If your object is more than that, then it's not that small.

Related

Memory defragmentation/heap compaction - commonplace in managed languages, but not in C++. Why?

I've been reading up a little on zero-pause garbage collectors for managed languages. From what I understand, one of the most difficult things to do without stop-the-world pauses is heap compaction. Only very few collectors (eg Azul C4, ZGC) seem to be doing, or at least approaching, this.
So, most GCs introduce dreaded stop-the-world pauses the compact the heap (bad!). Not doing this seems extremely difficult, and does come with a performance/throughput penalty. So either way, this step seems rather problematic.
And yet - as far as I know, most if not all GCs still do compact the heap occasionally. I've yet to see a modern GC that doesn't do this by default. Which leads me to believe: It has to be really, really important. If it wasn't, surely, the tradeoff wouldn't be worth it.
At the same time, I have never seen anyone do memory defragmentation in C++. I'm sure some people somewhere do, but - correct me if I am wrong - it does not at all seem to be a common concern.
I could of course imagine static memory somewhat lessens this, but surely, most codebases would do a fair amount of dynamic allocations?!
So I'm curious, why is that?
Are my assumptions (very important in managed languages; rarely done in C++) even correct? If yes, is there any explanation I'm missing?
Garbage collection can compact the heap because it knows where all of the pointers are. After all, it just finished tracing them. That means that it can move objects around and adjust the pointers (references) to the new location.
However, C++ cannot do that, because it doesn't know where all the pointers are. If the memory allocation library moved things around, there could be dangling pointers to the old locations.
Oh, and for long running processes, C++ can indeed suffer from memory fragmentation. This was more of a problem on 32-bit systems because it could fail to allocate memory from the OS, because it might have used up all of the available 1 MB memory blocks. In 64-bit it is almost impossible to create so many memory mappings that there is nowhere to put a new one. However, if you ended up with a 16 byte memory allocation in each 4K memory page, that's a lot of wasted space.
C and C++ applications solve that by using storage pools. For a web server, for example, it would start a pool with a new request. At the end of that web request, everything in the pool gets destroyed. The pool makes a nice, constant sized block of RAM that gets reused over and over without fragmentation.
Garbage collection tends to use recycling pools as well, because it avoids the strain of running a big GC trace and reclaim at the end of a connection.
One method some old operating systems like Apple OS 9 used before virtual memory was a thing is handles. Instead of a memory pointer, allocation returned a handle. That handle was a pointer to the real object in memory. When the operating system needed to compact memory or swap it to disk it would change the handle.
I have actually implemented a similar system in C++ using an array of handles into a shared memory map psuedo-database. When the map was compacted then the handle table was scanned for affected entries and updated.
Generic memory compaction is not generally useful nor desirable because of its costs.
What may be desirable is to have no wasted/fragmented memory and that can be achieved by other methods than memory compaction.
In C++ one can come up with a different allocation approach for objects that do cause fragmentation in their specific application, e.g. double-pointers or double-indexes to allow for object relocation; object pools or arenas that prevent or minimize fragmentation. Such solutions for specific object types is superior to generic garbage collection because they employ application/business specific knowledge which allows to minimize the scope/cost of object storage maintenance and also happen at most appropriate times.
A research found that garbage collected languages require 5 times more memory to achieve performance of non-GC equivalent programs. Memory fragmentation is more severe in GC languages.

Dealing with fragmentation in a memory pool?

Suppose I have a memory pool object with a constructor that takes a pointer to a large chunk of memory ptr and size N. If I do many random allocations and deallocations of various sizes I can get the memory in such a state that I cannot allocate an M byte object contiguously in memory even though there may be a lot free! At the same time, I can't compact the memory because that would cause a dangling pointer on the consumers. How does one resolve fragmentation in this case?
I wanted to add my 2 cents only because no one else pointed out that from your description it sounds like you are implementing a standard heap allocator (i.e what all of us already use every time when we call malloc() or operator new).
A heap is exactly such an object, that goes to virtual memory manager and asks for large chunk of memory (what you call "a pool"). Then it has all kinds of different algorithms for dealing with most efficient way of allocating various size chunks and freeing them. Furthermore, many people have modified and optimized these algorithms over the years. For long time Windows came with an option called low-fragmentation heap (LFH) which you used to have to enable manually. Starting with Vista LFH is used for all heaps by default.
Heaps are not perfect and they can definitely bog down performance when not used properly. Since OS vendors can't possibly anticipate every scenario in which you will use a heap, their heap managers have to be optimized for the "average" use. But if you have a requirement which is similar to the requirements for a regular heap (i.e. many objects, different size....) you should consider just using a heap and not reinventing it because chances are your implementation will be inferior to what OS already provides for you.
With memory allocation, the only time you can gain performance by not simply using the heap is by giving up some other aspect (allocation overhead, allocation lifetime....) which is not important to your specific application.
For example, in our application we had a requirement for many allocations of less than 1KB but these allocations were used only for very short periods of time (milliseconds). To optimize the app, I used Boost Pool library but extended it so that my "allocator" actually contained a collection of boost pool objects, each responsible for allocating one specific size from 16 bytes up to 1024 (in steps of 4). This provided almost free (O(1) complexity) allocation/free of these objects but the catch is that a) memory usage is always large and never goes down even if we don't have a single object allocated, b) Boost Pool never frees the memory it uses (at least in the mode we are using it in) so we only use this for objects which don't stick around very long.
So which aspect(s) of normal memory allocation are you willing to give up in your app?
Depending on the system there are a couple of ways to do it.
Try to avoid fragmentation in the first place, if you allocate blocks in powers of 2 you have less a chance of causing this kind of fragmentation. There are a couple of other ways around it but if you ever reach this state then you just OOM at that point because there are no delicate ways of handling it other than killing the process that asked for memory, blocking until you can allocate memory, or returning NULL as your allocation area.
Another way is to pass pointers to pointers of your data(ex: int **). Then you can rearrange memory beneath the program (thread safe I hope) and compact the allocations so that you can allocate new blocks and still keep the data from old blocks (once the system gets to this state though that becomes a heavy overhead but should seldom be done).
There are also ways of "binning" memory so that you have contiguous pages for instance dedicate 1 page only to allocations of 512 and less, another for 1024 and less, etc... This makes it easier to make decisions about which bin to use and in the worst case you split from the next highest bin or merge from a lower bin which reduces the chance of fragmenting across multiple pages.
Implementing object pools for the objects that you frequently allocate will drive fragmentation down considerably without the need to change your memory allocator.
It would be helpful to know more exactly what you are actually trying to do, because there are many ways to deal with this.
But, the first question is: is this actually happening, or is it a theoretical concern?
One thing to keep in mind is you normally have a lot more virtual memory address space available than physical memory, so even when physical memory is fragmented, there is still plenty of contiguous virtual memory. (Of course, the physical memory is discontiguous underneath but your code doesn't see that.)
I think there is sometimes unwarranted fear of memory fragmentation, and as a result people write a custom memory allocator (or worse, they concoct a scheme with handles and moveable memory and compaction). I think these are rarely needed in practice, and it can sometimes improve performance to throw this out and go back to using malloc.
write the pool to operate as a list of allocations, you can then extended and destroyed as needed. this can reduce fragmentation.
and/or implement allocation transfer (or move) support so you can compact active allocations. the object/holder may need to assist you, since the pool may not necessarily know how to transfer types itself. if the pool is used with a collection type, then it is far easier to accomplish compacting/transfers.

questions about memory pool

I need some clarifications for the concept & implementation on memory pool.
By memory pool on wiki, it says that
also called fixed-size-blocks allocation, ... ,
as those implementations suffer from fragmentation because of variable
block sizes, it can be impossible to use them in a real time system
due to performance.
How "variable block size causes fragmentation" happens? How fixed sized allocation can solve this? This wiki description sounds a bit misleading to me. I think fragmentation is not avoided by fixed sized allocation or caused by variable size. In memory pool context, fragmentation is avoided by specific designed memory allocators for specific application, or reduced by restrictly using an intended block of memory.
Also by several implementation samples, e.g., Code Sample 1 and Code Sample 2, it seems to me, to use memory pool, the developer has to know the data type very well, then cut, split, or organize the data into the linked memory chunks (if data is close to linked list) or hierarchical linked chunks (if data is more hierarchical organized, like files). Besides, it seems the developer has to predict in prior how much memory he needs.
Well, I could imagine this works well for an array of primitive data. What about C++ non-primitive data classes, in which the memory model is not that evident? Even for primitive data, should the developer consider the data type alignment?
Is there good memory pool library for C and C++?
Thanks for any comments!
Variable block size indeed causes fragmentation. Look at the picture that I am attaching:
The image (from here) shows a situation in which A, B, and C allocates chunks of memory, variable sized chunks.
At some point, B frees all its chunks of memory, and suddenly you have fragmentation. E.g., if C needed to allocate a large chunk of memory, that still would fit into available memory, it could not do because available memory is split in two blocks.
Now, if you think about the case where each chunk of memory would be of the same size, this situation would clearly not arise.
Memory pools, of course, have their own drawbacks, as you yourself point out. So you should not think that a memory pool is a magical wand. It has a cost and it makes sense to pay it under specific circumstances (i.e., embedded system with limited memory, real time constraints and so on).
As to which memory pool is good in C++, I would say that it depends. I have used one under VxWorks that was provided by the OS; in a sense, a good memory pool is effective when it is tightly integrated with the OS. Actually each RTOS offers an implementation of memory pools, I guess.
If you are looking for a generic memory pool implementation, look at this.
EDIT:
From you last comment, it seems to me that possibly you are thinking of memory pools as "the" solution to the problem of fragmentation. Unfortunately, this is not the case. If you want, fragmentation is the manifestation of entropy at the memory level, i.e., it is inevitable. On the other hand, memory pools are a way to manage memory in such a way as to effectively reduce the impact of fragmentation (as I said, and as wikipedia mentioned, mostly on specific systems like real time systems). This comes to a cost, since a memory pool can be less efficient than a "normal" memory allocation technique in that you have a minimum block size. In other words, the entropy reappears under disguise.
Furthermore, that are many parameters that affect the efficiency of a memory pool system, like block size, block allocation policy, or whether you have just one memory pool or you have several memory pools with different block sizes, different lifetimes or different policies.
Memory management is really a complex matter and memory pools are just a technique that, like any other, improves things in comparison to other techniques and exact a cost of its own.
In a scenario where you always allocate fixed-size blocks, you either have enough space for one more block, or you don't. If you have, the block fits in the available space, because all free or used spaces are of the same size. Fragmentation is not a problem.
In a scenario with variable-size blocks, you can end up with multiple separate free blocks with varying sizes. A request for a block of a size that is less than the total memory that is free may be impossible to be satisfied, because there isn't one contiguous block big enough for it. For example, imagine you end up with two separate free blocks of 2KB, and need to satisfy a request for 3KB. Neither of these blocks will be enough to provide for that, even though there is enough memory available.
Both fix-size and variable size memory pools will feature fragmentation, i.e. there will be some free memory chunks between used ones.
For variable size, this might cause problems, since there might not be a free chunk that is big enough for a certain requested size.
For fixed-size pools, on the other hand, this is not a problem, since only portions of the pre-defined size can be requested. If there is free space, it is guaranteed to be large enough for (a multiple of) one portion.
If you do a hard real time system, you might need to know in advance that you can allocate memory within the maximum time allowed. That can be "solved" with fixed size memory pools.
I once worked on a military system, where we had to calculate the maximum possible number of memory blocks of each size that the system could ever possibly use. Then those numbers were added to a grand total, and the system was configured with that amount of memory.
Crazily expensive, but worked for the defence.
When you have several fixed size pools, you can get a secondary fragmentation where your pool is out of blocks even though there is plenty of space in some other pool. How do you share that?
With a memory pool, operations might work like this:
Store a global variable that is a list of available objects (initially empty).
To get a new object, try to return one from the global list of available. If there isn't one, then call operator new to allocate a new object on the heap. Allocation is extremely fast which is important for some applications that might currently be spending a lot of CPU time on memory allocations.
To free an object, simply add it to the global list of available objects. You might place a cap on the number of items allowed in the global list; if the cap is reached then the object would be freed instead of returned to the list. The cap prevents the appearance of a massive memory leak.
Note that this is always done for a single data type of the same size; it doesn't work for larger ones and then you probably need to use the heap as usual.
It's very easy to implement; we use this strategy in our application. This causes a bunch of memory allocations at the beginning of the program, but no more memory freeing/allocating occurs which incurs significant overhead.

Heap Behavior in C++

Is there anything wrong with the optimization of overloading the global operator new to round up all allocations to the next power of two? Theoretically, this would lower fragmentation at the cost of higher worst-case memory consumption, but does the OS already have redundant behavior with this technique, or does it do its best to conserve memory?
Basically, given that memory usage isn't as much of an issue as performance, should I do this?
The default memory allocator is probably quite smart and will deal well with large numbers of small to medium sized objects, as this is the most common case. For all allocators, the number of bytes requested is never always the amount allocated. For example, if you say:
char * p = new char[3];
the allocator almost certainly does something like:
char * p = new char[16]; // or some minimum power of 2 block size
Unless you can demonstrate that you have an actual problem with allocations, you should not consider writing your own version of new.
You should try implementing it for fun. As soon as it works, throw it away.
Should you do this? No.
Two reasons:
Overloading the global new operator will inevitably cause you pain, especially when external libraries take dependency on the stock versions.
Modern OS implementation of the heap already take fragmentation into consideration. If you're on Windows, you can look into "Low Fragmentation Heap" if you have a special need.
To summarize, don't mess with it unless you can prove (by profiling) that it is a problem to begin with. Don't optimize pre-maturely.
I agree with Neil, Alienfluid and Fredoverflow that in most cases you don't want to write your own memory allocator, but I still wrote my own memory allocator about 15 years and refined it over the years (first version was with malloc/free redefinition, later versions using global new/delete operators) and in my experience, the advantages can be enormous:
Memory leak tracing can be built in your application. No need to run external applications that slow down your applications.
If you implement different strategies, you sometimes find difficult problems just switching to a different memory allocation strategy
To find difficult memory-related bugs, you can easily add logging to your memory allocator and even further refine it (e.g. log all news and deletes for memory of size N bytes)
You can use page-allocation strategies, where you allocate a complete 4KB page and set the page size so that buffer overflows are caught immediately
You can add logic to delete to print out if memory is freed twice
It's easy to add a red zone to memory allocations (a checksum before the allocated memory and one after the allocated memory) to find buffer overflows/underflows more quickly
...

Faster to malloc multiple small times or few large times?

When using malloc to allocate memory, is it generally quicker to do multiple mallocs of smaller chunks of data or fewer mallocs of larger chunks of data? For example, say you are working with an image file that has black pixels and white pixels. You are iterating through the pixels and want to save the x and y position of each black pixel in a new structure that also has a pointer to the next and previous pixels x and y values. Would it be generally faster to iterate through the pixels allocating a new structure for each black pixel's x and y values with the pointers, or would it be faster to get a count of the number of black pixels by iterating through once, then allocating a large chunk of memory using a structure containing just the x and y values, but no pointers, then iterating through again, saving the x and y values into that array? I'm assuming certain platforms might be different than others as to which is faster, but what does everyone think would generally be faster?
It depends:
Multiple small times means multiple times, which is slower
There may be a special/fast implementation for small allocations.
If I cared, I'd measure it! If I really cared a lot, and couldn't guess, then I might implement both, and measure at run-time on the target machine, and adapt accordingly.
In general I'd assume that fewer is better: but there are size and run-time library implementations such that a (sufficiently) large allocation will be delegated to the (relatively slow) O/S. whereas a (sufficiently) small allocation will be served from a (relatively quick) already-allocated heap.
Allocating large blocks is more efficient; additionally, since you are using larger contiguous blocks, you have greater locality of reference, and traversing your in-memory structure once you've generated it should also be more efficient! Further, allocating large blocks should help to reduce memory fragmentation.
Generally speaking, allocating larger chunks of memory fewer times will be faster. There's overhead involved each time a call to malloc() is made.
Except speed issues there is also the memory fragmentation problem.
Allocating memory is work. The amount of work done when allocating a block of memory is typically independent of the size of the block. You work it out from here.
It's faster not to allocate in performance-sensitive code at all. Allocate the memory you're going to need once in advance, and then use and reuse that as much as you like.
Memory allocation is a relatively slow operation in general, so don't do it more often than necessary.
In general malloc is expensive. It has to find an appropriate memory chunk from which to allocate memory and keep track of non-contiguous memory blocks. In several libraries you will find small memory allocators that try to minimize the impact by allocating a large block and managing the memory in the allocator.
Alexandrescu deals with the problem in 'Modern C++ Design' and in the Loki library if you want to take a look at one such libs.
This question is one of pragmatism, I'm afraid; that is to say, it depends.
If you have a LOT of pixels, only a few of which are black then counting them might be the highest cost.
If you're using C++, which your tags suggest you are, I would strongly suggest using STL, somthing like std::vector.
The implementation of vector, if I remember correctly, uses a pragmatic approach to allocation. There are a few heuristics for allocation strategies, an informative one is this:
class SampleVector {
int N,used,*data;
public:
SampleVector() {N=1;used=0;data=malloc(N);}
void push_back(int i)
{
if (used>=N)
{
// handle reallocation
N*=2;
data=realloc(data,N);
}
data[used++]=i;
}
};
In this case, you DOUBLE the amount of memory allocated every time you realloc.
This means that reallocations progressively halve in frequency.
Your STL implementation will have been well-tuned, so if you can use that, do!
Another point to consider is how this interacts with threading. Using malloc many times in a threaded concurrent application is a major drag on performance. In that environment you are better off with a scalable allocator like the one used in Intel's Thread Building Blocks or Hoard. The major limitation with malloc is that there is a single global lock that all the threads contend for. It can be so bad that adding another thread dramatically slows down your application.
As already mentonned, malloc is costly, so fewer will probably be faster.
Also, working with the pixels, on most platforms will have less cache-misses and will be faster.
However, there is no guarantee on every platforms
Next to the allocation overhead itself, allocating multiple small chunks may result in lots of cache misses, while if you can iterate through a contiguous block, chances are better.
The scenario you describe asks for preallocation of a large block, imho.
Although allocating large blocks is faster per byte of allocated memory, it will probably not be faster if you artificially increase the allocation size only to chop it up yourself. You're are just duplicating the memory management.
Do an iteration over the pixels to count the number of them to be stored.
Then allocate an array for the exact number of items. This is the most efficient solution.
You can use std::vector for easier memory management (see the std::vector::reserve procedure). Note: reserve will allocate probably a little (probably up to 2 times) more memory then necessary.
"I can allocate-it-all" (really, I can!)
We can philosophy about some special implementations, that speed up small allocations considerably ... yes! But in general this holds:
malloc must be general. It must implement all different kinds of allocations. That is the reason it is considerably slow! It might be, that you use a special kinky-super-duper Library, that speeds things up, but also those can not do wonders, since they have to implement malloc in its full spectrum.
The rule is, when you have more specialized allocation coding, you are always faster then the broad "I can allocate-it-all" routine "malloc".
So when you are able to allocate the memory in bigger blocks in your coding (and it does not cost you to much) you can speed up things considerably. Also - as mentioned by others - you will get lot less fragmentation of memory, that also speeds things up and can cost less memory. You must also see, that malloc needs additional memory for every chunk of memory it returns to you (yes, special routines can reduce this ... but you don't know! what it does really unless you implemented it yourself or bought some wonder-library).