I found that many times I only need a small std::map (say less than 10 keys), or a small std::vector containing only a few elements, and I think it's really a waste of performance to always dynamically allocate them, especially in structures like std::map<std::string, std::string>, std::vector<std::string>, there're really a lot dynamic allocation involved.
Any good advice? At least reduce the amount of dynamic allocation, better without sacrifying the ease of use. Thanks
You may use stack-allocated memory for small size data (as stack allocations are very fast, basically just a stack-pointer movement; although stack's space is precious and it's a very limited resource), and heap-allocated memory for larger size. In other words, think along the lines of the std::string's small string optimization.
Moreover, to speed up allocations, you could also preallocate big memory chunks on the heap, and then carve smaller allocations inside those chunks, again basically just increasing a pointer inside the chunk. For a sample implementation of this pool allocator technique, consider reading this blog post.
You will find this CppCon 2016 talk interesting as well:
High Performance Code 201: Hybrid Data Structures
Related
Apologies if this has been asked before, I can't find a question that fully answers what I want to know. They mention ways to do this, but don't compare approaches.
I am writing a program in C++ to solve a PDE to steady state. I don't know how many time steps this will take. Therefore I don't know how long my time arrays will be. This will have a maximum time of 100,000s, but the time step could be as small as .001, so it could be as many as 1e8 doubles in length in the worst case (not necessarily a rare case either).
What is the most efficient way to implement this in terms of memory allocated and running time?
Options I've looked at:
Dynamically allocating an array with 1e8 elements, most of which won't ever be used.
Allocating a smaller array initially, creating a larger array when needed and copying elements over
Using std::vector and it's size increasing functionality
Are there any other options?
I'm primarily concerned with speed, but I want to know what memory considerations come into it as well
If you are concerned about speed just allocate 1e8 doubles and be done with it.
In most cases vector should work just fine. Remember that amortized it's O(1) for the append.
Unless you are running on something very weird the OS memory allocation should take care of most fragmentation issues and the fact that it's hard to find a 800MB free memory block.
As noted in the comments, if you are careful using vector, you can actually reserve the capacity to store the maximum input size in advance (1e8 doubles) without paging in any memory.
For this you want to avoid the fill constructor and methods like resize (which would end up accessing all the memory) and use reserve and push_back to fill it and only touch memory as needed. That will allow most operating systems to simply page in chunks of your accessed vector at a time instead of the entire contents all at once.
Yet I tend to avoid this solution for the most part at these kinds of input scales, but for simple reasons:
A possibly-paranoid portability fear that I may encounter an operating system which doesn't have this kind of page-on-demand behavior.
A possibly-paranoid fear that the allocation may fail to find a contiguous set of unused pages and face out of memory errors (this is a grey zone -- I tend to worry about this for arrays which span gigabytes, hundreds of megabytes is borderline).
Just a totally subjective and possibly dumb/old bias towards not leaning too heavily on the operating system's behavior for paging in allocated memory, and preferring to have a data structure which simply allocates on demand.
Debugging.
Among the four, the first two could simply be paranoia. The third might just be plain dumb. Yet at least on operating systems like Windows, when using a debug build, the memory is initialized in its entirety early, and we end up mapping the allocated pages to DRAM immediately on reserving capacity for such a vector. Then we might end up leading to a slight startup delay and a task manager showing 800 megabytes of memory usage for a debug build even before we've done anything.
While generally the efficiency of a debug build should be a minor concern, when the potential discrepancy between release and debug is enormous, it can start to render production code almost incapable of being effectively debugged. So when the differences are potentially vast like this, my preference is to "chunk it up".
The strategy I like to apply here is to allocate smaller chunks -- smaller arrays of N elements, where N might be, say, 512 doubles (just snug enough to fit a common denominator page size of 4 kilobytes -- possibly minus a couple of doubles for chunk metadata). We fill them up with elements, and when they get full, create another chunk.
With these chunks, we can aggregate them together by either linking them (forming an unrolled list) or storing a vector of pointers to them in a separate aggregate depending on whether random-access is needed or merely sequential access will suffice. For the random-access case, this incurs a slight overhead, yet one I've tended to find relatively small at these input scales which often have times dominated by the upper levels of the memory hierarchy rather than register and instruction level.
This might be overkill for your case and a careful use of vector may be the best bet. Yet if that doesn't suffice and you have similar concerns/needs as I do, this kind of chunky solution might help.
The only way to know which option is 'most efficient' on your machine is to try a few different options and profile. I'd probably start with the following:
std::vector constructed with the maximum possible size.
std::vector constructed with a conservative ballpark size and push_back.
std::deque and push_back.
The std::vector vs std::deque debate is ongoing. In my experience, when the number of elements is unknown and not too large, std::deque is almost never faster than std::vector (even if the std::vector needs multiple reallocations) but may end up using less memory. When the number of elements is unknown and very large, std::deque memory consumption seems to explode and std::vector is the clear winner.
If after profiling, none of these options offers satisfactory performance, then you may want to consider writing a custom allocator.
I'm loading about ~1000 files, each representing an array of ~3 million floats. I need to have them all in memory together as I need to do some calculations that involve all of them.
In the code below, I've broken out the memory allocation and file reading, so I can observe the speed of each separately. I was a bit surprised to find the memory allocation taking much longer than the file reading.
std::vector<std::vector<float> * > v(matrix_count);
for(int i=0; i < matrix_count; i++) {
v[i] = new std::vector<float>(array_size);
}
for(int i=0; i < matrix_count; i++) {
std::ifstream is(files[i]);
is.read((char*) &((*v[i])[0]), size);
is.close();
}
Measuring the time, the allocating loop took 6.8s while file loading took 2.5s. It seems counter-intuitive that reading from the disk is almost 3x faster than just allocating space for it.
Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.
Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.
I mainly wanted to respond by addressing this one part: bad_alloc exceptions tend to be misunderstood. They're not the result of "running out of memory" -- they're the result of the system failing to find a contiguous block of unused pages. You could have plenty more than enough memory available and still get a bad_alloc if you get in the habit of trying to allocate massive blocks of contiguous memory, simply because the system can't find a contiguous set of pages that are free. You can't necessarily avoid bad_alloc by "making sure plenty of memory is free" as you might have already seen where having over 100 gigabytes of RAM can still make you vulnerable to them when trying to allocate a mere 10 GB block. The way to avoid them is to allocate memory in smaller chunks instead of one epic array. At a large enough scale, structures like unrolled lists can start to offer favorable performance over a gigantic array and a much lower (exponentially) probability of ever getting a bad_alloc exception unless you actually do exhaust all the memory available. There is actually a peak where contiguity and the locality of reference it provides ceases to become beneficial and may actually hinder memory performance at a large enough size (mainly due to paging, not caching).
For the kind of epic scale input you're handling, you might actually get better performance out of std::deque given the page-friendly nature of it (it's one of the few times where deque can really shine without need for push_front vs. vector). It's something to potentially try if you don't need perfect contiguity.
Naturally it's best if you measure this with an actual profiler. It'll help us hone in on the actual problem, though it might not be completely shocking (surprising but maybe not shocking) that you might be bottlenecked by memory here instead of disk IO given the kind of "massive number of massive blocks" you're allocating (disk IO is slow but memory heap allocation can sometimes be expensive if you are really stressing the system). It depends a lot on the system's allocation strategy but even slab or buddy allocators can fall back to a much slower code branch if you allocate such epic blocks of memory and en masse, and allocations may even start to require something resembling a search or more access to secondary storage in those extreme cases (here I'm afraid I'm not sure exactly what goes on behind the hood when allocating so many massive blocks, but I have "felt" and measured these kinds of bottlenecks before but in a way where I never quite figured out what the OS was doing exactly -- this above paragraph is purely conjecture).
Here it's kind of counter-intuitive but you can often get better performance allocating a larger number of smaller blocks. Typically that makes things worse, but if we're talking about 3 million floats per memory block and a thousand memory blocks like it, it might help to start allocating in, say, page-friendly 4k chunks. Typically it's cheaper to pre-allocate memory in large blocks in advance and pool it, but "large" in this case is more like 4 kilobyte blocks, not 10 gigabyte blocks.
std::deque will typically do this kind of thing for you so it might be the quickest thing to try out to see if it helps. With std::deque, you should be able to make a single one for all 10 GB worth of contents without splitting it into smaller ones to avoid bad_alloc. It also doesn't have the zero-initialization overhead of the entire contents that some cited, and push_backs to it are constant-time even in the worst-case scenario (not amortized constant time as with std::vector), so I would try std::deque with actually push_back instead of pre-sizing it and using operator[]. You could read the file contents in small chunks at a time (ex: using 4k byte buffers) and just push back the floats. It's something to try anyway.
Anyway, these are all just educated guesses without code and profiling measurements, but these are some things to try out after your measurements.
MMFs may also be the ideal solution for this scenario. Let the OS handle all the tricky details of what it takes to access the file's contents.
Use multiple threads for both memory allocation and reading files. You can create a set of say 15 threads and let each thread pick up the next available job.
When you dig deeper, you will see that opening the file also has a considerable overhead which gets reduced substantially by using multiple threads.
You don't need to handle all the data in memory. Instead of that, you should use something like virtual vector which loads required data when needed. Using that approach saves the memory and don't brings your to side effects of huge memory allocation.
I have a baseclass named GameObject from which other classes derive.
I am wondering if handling the memory allocation by allocating all derived classes of GameObjects in contiguous memory will improve performance.
I will end up iterating over all of them each game engine frame.
My question is, does contiguous memory storage in this case give me faster iteration times than mallocing memory without contiguation? In both cases, I have to keep a vector of pointers to the Game Objects since they will vary in size.
Iterating through objects in continuous memory likely works better because cache and locality. However, I recommend that you build the two systems and actually profile them. Good luck!
I'm not sure I understand the question. Are you asking if its better to pre-allocate all your objects in one giant block of memory and store the pointers to subsections of that in memory? If so please don't do that.
Instead of being faster you're more likely bound to slow things down since the system has to request contiguous memory of larger blocks instead of non-contiguous of smaller blocks. Keep in mind block allocation, paging, etc. You may request 100 megabytes of contiguous memory but its really not contiguous. A bunch of it is in disk and everything is broken up into pages anyways.
Then you're faced with the question of do you allocate all your GameObjects in one go to get contiguous memory or are you creating them on demand? Do you really want to pre-allocate for this one minor optimization? What happens if you need to create a new object and your contiguous memory block wasn't large enough? etc.
Really I'm just brainstorming potential problems here. Like the other comments said its a case of premature optimization.
Now, it would certainly be faster if you stored all your pointers in a contiguous array instead of a vector which grows and copies based on its current size, but even then unless you absolutely know the amount of game objects you're better off just allocating a sufficiently large vector so that it only grows once or twice.
I've read about Small-Object Allocation in "Modern C++ Design". Andrei Alexandrescu argues that the general purpose operators (new and delete) perform badly for allocating small objects.
In my program there are lot of objects created and destroyed on the free store. These objects measure more than 8000 bytes.
What size is considered small? Are 8000 bytes small or big when it comes to memory allocation in C++?
The definition of "small" varies, but generally speaking, an object can be considered "small" if its size is smaller than the size overhead caused by heap allocation (or is at least close to that size).
So, a 16 byte object would probably be considered "small" in most cases. A 32 byte object might be considered small under certain specific circumstances. An 8,000 byte object is definitely not "small."
Usually, if you are going to go through the trouble of using a small object allocator, you are looking to improve the performance of some block of code. If using a small object allocator doesn't help your performance, you probably shouldn't use one.
[...] argues that the general purpose operators (new and delete) perform badly for allocating small objects.
Platform dependent. E.g. on Linux I once benchmarked my homegrown AVL tree with static memory management vs. GNU's std::map which is red-black tree and fully dynamic memory management. To my surprise, std::map sometimes outran my own highly-efficient implementation. And std::map does hazardous amount of small memory allocations.
In my program there are lot of objects created and destroyed on the free store.
Memory management concern is a valid one. In a sense that you should always try to reuse existing resources if possible or avoid creation of temporary copies.
That's if efficiency/CPU performance you are after. If the code is ran rarely, then it is pointless to bother.
These objects measure more than 8000 bytes. What size is considered small? Are 8000 bytes small or big when it comes to memory allocation in C++?
This is pointless question. If your program needs an object taking 8K, then your program needs it. Period.
You should start worry only if you would receive complains that software takes too much RAM or profiler points to the performance bottleneck in memory management. Otherwise, modern memory management is relatively fast and robust.
P.S. I personally would consider 8K to be an average memory allocation size. Not small - not big. But I'm already used to work with programs which on a whim allocate 10+GB on heap. If data set has to be resident in RAM and it is 10GB in size, well, then application has little choice but to try to load it.
I certainly wouldn't consider 8000 bytes to be 'small'. Small objects most likely means objects occupying no more than a few hundred bytes - objects amounting to a handful of bytes are going to lead to the biggest problems - however as KennyTM points out this is implementation dependent and some C++ runtimes may well be great at handling small objects.
The question is not, how small is the object, but, how much memory overhead is invoked by operator new/delete? If your object is more than that, then it's not that small.
When using malloc to allocate memory, is it generally quicker to do multiple mallocs of smaller chunks of data or fewer mallocs of larger chunks of data? For example, say you are working with an image file that has black pixels and white pixels. You are iterating through the pixels and want to save the x and y position of each black pixel in a new structure that also has a pointer to the next and previous pixels x and y values. Would it be generally faster to iterate through the pixels allocating a new structure for each black pixel's x and y values with the pointers, or would it be faster to get a count of the number of black pixels by iterating through once, then allocating a large chunk of memory using a structure containing just the x and y values, but no pointers, then iterating through again, saving the x and y values into that array? I'm assuming certain platforms might be different than others as to which is faster, but what does everyone think would generally be faster?
It depends:
Multiple small times means multiple times, which is slower
There may be a special/fast implementation for small allocations.
If I cared, I'd measure it! If I really cared a lot, and couldn't guess, then I might implement both, and measure at run-time on the target machine, and adapt accordingly.
In general I'd assume that fewer is better: but there are size and run-time library implementations such that a (sufficiently) large allocation will be delegated to the (relatively slow) O/S. whereas a (sufficiently) small allocation will be served from a (relatively quick) already-allocated heap.
Allocating large blocks is more efficient; additionally, since you are using larger contiguous blocks, you have greater locality of reference, and traversing your in-memory structure once you've generated it should also be more efficient! Further, allocating large blocks should help to reduce memory fragmentation.
Generally speaking, allocating larger chunks of memory fewer times will be faster. There's overhead involved each time a call to malloc() is made.
Except speed issues there is also the memory fragmentation problem.
Allocating memory is work. The amount of work done when allocating a block of memory is typically independent of the size of the block. You work it out from here.
It's faster not to allocate in performance-sensitive code at all. Allocate the memory you're going to need once in advance, and then use and reuse that as much as you like.
Memory allocation is a relatively slow operation in general, so don't do it more often than necessary.
In general malloc is expensive. It has to find an appropriate memory chunk from which to allocate memory and keep track of non-contiguous memory blocks. In several libraries you will find small memory allocators that try to minimize the impact by allocating a large block and managing the memory in the allocator.
Alexandrescu deals with the problem in 'Modern C++ Design' and in the Loki library if you want to take a look at one such libs.
This question is one of pragmatism, I'm afraid; that is to say, it depends.
If you have a LOT of pixels, only a few of which are black then counting them might be the highest cost.
If you're using C++, which your tags suggest you are, I would strongly suggest using STL, somthing like std::vector.
The implementation of vector, if I remember correctly, uses a pragmatic approach to allocation. There are a few heuristics for allocation strategies, an informative one is this:
class SampleVector {
int N,used,*data;
public:
SampleVector() {N=1;used=0;data=malloc(N);}
void push_back(int i)
{
if (used>=N)
{
// handle reallocation
N*=2;
data=realloc(data,N);
}
data[used++]=i;
}
};
In this case, you DOUBLE the amount of memory allocated every time you realloc.
This means that reallocations progressively halve in frequency.
Your STL implementation will have been well-tuned, so if you can use that, do!
Another point to consider is how this interacts with threading. Using malloc many times in a threaded concurrent application is a major drag on performance. In that environment you are better off with a scalable allocator like the one used in Intel's Thread Building Blocks or Hoard. The major limitation with malloc is that there is a single global lock that all the threads contend for. It can be so bad that adding another thread dramatically slows down your application.
As already mentonned, malloc is costly, so fewer will probably be faster.
Also, working with the pixels, on most platforms will have less cache-misses and will be faster.
However, there is no guarantee on every platforms
Next to the allocation overhead itself, allocating multiple small chunks may result in lots of cache misses, while if you can iterate through a contiguous block, chances are better.
The scenario you describe asks for preallocation of a large block, imho.
Although allocating large blocks is faster per byte of allocated memory, it will probably not be faster if you artificially increase the allocation size only to chop it up yourself. You're are just duplicating the memory management.
Do an iteration over the pixels to count the number of them to be stored.
Then allocate an array for the exact number of items. This is the most efficient solution.
You can use std::vector for easier memory management (see the std::vector::reserve procedure). Note: reserve will allocate probably a little (probably up to 2 times) more memory then necessary.
"I can allocate-it-all" (really, I can!)
We can philosophy about some special implementations, that speed up small allocations considerably ... yes! But in general this holds:
malloc must be general. It must implement all different kinds of allocations. That is the reason it is considerably slow! It might be, that you use a special kinky-super-duper Library, that speeds things up, but also those can not do wonders, since they have to implement malloc in its full spectrum.
The rule is, when you have more specialized allocation coding, you are always faster then the broad "I can allocate-it-all" routine "malloc".
So when you are able to allocate the memory in bigger blocks in your coding (and it does not cost you to much) you can speed up things considerably. Also - as mentioned by others - you will get lot less fragmentation of memory, that also speeds things up and can cost less memory. You must also see, that malloc needs additional memory for every chunk of memory it returns to you (yes, special routines can reduce this ... but you don't know! what it does really unless you implemented it yourself or bought some wonder-library).