How to allocate a large dynamic array in C++? - c++

So I am currently trying to allocate dynamically a large array of elements in C++ (using "new"). Obviously, when "large" becomes too large (>4GB), my program crashes with a "bad_alloc" exception because it can't find such a large chunk of memory available.
I could allocate each element of my array separately and then store the pointers to these elements in a separate array. However, time is critical in my application so I would like to avoid as much cache misses as I can. I could also group some of these elements into blocks but what would be the best size for such a block?
My question is then: what is the best way (timewise) to allocate dynamically a large array of elements such that elements do not have to be stored contiguously but they must be accessible by index (using [])? This array is never going to be resized, no elements is going to be inserted or deleted of it.
I thought I could use std::deque for this purpose, knowing that the elements of an std::deque might or might not be stored contiguously in memory but I read there are concerns about the extra memory this container takes?
Thank you for your help on this!

If your problem is such that you actually run out of memory allocating fairly small blocks (as is done by deque) is not going to help, the overhead of tracking the allocations will only make the situation worse. You need to re-think your implementation such that you can deal with it in blocks that will still fit in memory. For such problems, if using x86 or x64 based hardware I would suggest blocks of at least 2 megabytes (the large page size).

Obviously, when "large" becomes too large (>4GB), my program crashes
with a "bad_alloc" exception because it can't find such a large chunk
of memory available.
You should be using 64-bit CPU and OS at this point, allocating huge contiguous chunk of memory should not be a problem, unless you are actually running out of memory. It is possible that you are building 32-bit program. In this case you won't be able to allocate more than 4 GB. You should build 64-bit application.
If you want something better than plain operator new, then your question is OS-specific. Look at API provided by your OS: on POSIX system you should look for mmap and for VirtualAlloc on Windows.
There are multiple problems with large allocations:
For security reasons OS kernel never gives you pages filled with garbage values, instead all new memory will be zero initialized. This means you don't have to initialize that memory as long as zeroes are exactly what you want.
OS gives you real memory lazily on first access. If you are processing large array, you might waste a lot of time taking page faults. To avoid this you can use MAP_POPULATE on Linux. On Windows you can try PrefetchVirtualMemory (but I am not sure if it can do the job). This should make init allocation slower, but should decrease total time spent in kernel.
Working with large chunks of memory wastes slots in Translation Lookaside Buffer (TLB). Depending on you memory access pattern, this can cause noticeable slowdown. To avoid this you can try using large pages (mmap with MAP_HUGETLB, MAP_HUGE_2MB, MAP_HUGE_1GB on Linux, VirtualAlloc and MEM_LARGE_PAGES). Using large pages is not easy, as they are usually not available by default. They also cannot be swapped out (always "locked in memory"), so using them requires privileges.
If you don't want to use OS-specific functions, the best you can find in C++ is std::calloc. Unlike std::malloc or operator new it returns zero initialized memory so you can probably avoid wasting time initializing that memory. Other than that, there is nothing special about that function. But this is the closest you can get while staying withing standard C++.
There are no standard containers designed to handle large allocations, moreover, all standard container are really really bad at handling those situations.
Some OSes (like Linux) overcommit memory, others (like Windows) do not. Windows might refuse to give you memory if it knows it won't be able to satisfy your request later. To avoid this you might want to increase your page file. Windows needs to reserve that space on disk beforehand, but it does not mean it will use it (start swapping). As actual memory is given to programs lazily, there are might be a lot of memory reserved for applications that will never be actually given to them.
If increasing page file is too inconvenient, you can try creating large file and map it into memory. That file will serve as a "page file" for your memory. See CreateFileMapping and MapViewOfFile.

The answer to this question is extremely application, and platform, dependent. These days if you just need a small integer factor greater than 4GB, you use a 64-bit machine, if possible. Sometimes reducing the size of the element in the array is possible as well. (E.g. using 16-bit fixed-point of half-float instead of 32-bit float.)
Beyond this, you are either looking at sparse arrays or out-of-core techniques. Sparse arrays are used when you are not actually storing elements at all locations in the array. There are many possible implementations and which is best depends on both the distribution of the data and the access pattern of the algorithm. See Eigen for example.
Out-of-core involves explicitly reading and writing parts of the array to/from disk. This used to be fairly common, but people work pretty hard to avoid doing this now. Applications that really require such are often built on top of a database or similar to handle the data management. In scientific computing, one ends up needing to distribute the compute as well as the data storage so there's a lot of complexity around that as well. For important problems the entire design may be driven by having good locality of reference.
Any sparse data structure will have overhead in how much space it takes. This can be fairly low, but it means you have to be careful if you actually have a dense array and are simply looking to avoid memory fragmentation.
If your problem can be broken into smaller pieces that only access part of the array at a time and the main issue is memory fragmentation making it hard to allocate one large block, then breaking the array in to pieces, effectively adding an outer vector of pointers, is a good bet. If you have random access to an array larger than 4 gigabytes and no way to localize the accesses, 64-bit is the way to go.

Depending on what you need the memory for and your speed concerns, and if you're using Linux, you can always try using mmap and simulate a sort of swap. It might be slower, but you can map very large sizes. See Mmap() an entire large file

Related

Is allocating array an expensive operation in fortran?

In a MPI PIC code I am writing, the array size I actually need in storing particles in a processor fluctuates with time, with size changing between [0.5n : 1.5n], where n is an average size.
Presently, I allocate arrays of the largest size, i.e, 1.5*n, in this case, for once in each processor and use them without changing thier size afterward.
I am considering an alternative way: i.e., re-allocating all the arrays each time step with their correct sizes, so that I can save memory. But I worry whether re-allocating arrays is expensive and this overhead will slow the code substantially.
Can this issue be verified only by actually profiling the code, or, there is a simple principle inicating that the allocation operating is cheap enough so that we do not need worry about its overhead?
Someone said:
"ALLOCATE does not imply physical memory allocation. For example, you can ALLOCATE an array up to the size of your virtual memory limit, then use it as a sparse array, using physical memory pages only as the space is addressed."
Is this true in Fortran?
There is no single correct answer to this question. And a complete answer would need to explain how a typical Fortran memory allocator works, AND how typical virtual memory systems work. (That is too broad for a StackOverflow Q&A.)
But here are a couple of salient points.
When you reallocate an array you have the overhead of copying the data in the old array to the new array.
Reallocating an array doesn't necessarily reduce your processes actual memory usage. Memory is requested from the OS in large regions (memory segments) and the Fortran allocator then manages the memory it has been given and responds to the application's allocate and deallocate requests. When an array is deallocated, the memory can't be handed back to the OS because there will most likely be other allocated arrays in the same region.
In fact, repeated allocation and deallocation of variable sized arrays can lead to fragmentation ... which further increases memory usage.
What does this mean for you?
That's not clear. It will depend on exactly what your application's memory usage patterns are. And it will depend on how your Fortran runtime's memory allocator works.
But my gut feeling is that you are probably better off NOT trying to dynamically resize arrays to (just) save memory.
Someone said: "ALLOCATE does not imply physical memory allocation. For example, you can ALLOCATE an array up to the size of your virtual memory limit, then use it as a sparse array, using physical memory pages only as the space is addressed."
That is true, but it is not the complete picture.
You also need to consider what happens when an application's virtual memory usage exceeds the physical memory pages available. In that scenario, when the application tries to access a virtual memory page that is not in physical memory the OS virtual memory system needs to "page" another VM page out of physical RAM and "page" in the VM page that the application wants. This will entail writing the existing page (if it is dirty) to the paging device and then reading in the new one. This is going to take a significant length of time, and it will impact on application performance.
If the ratio of available physical RAM to the application's VM working set is too out of balance, the entire system can go into "virtual memory thrashing" ... which can lead to the machine becoming non-responsive and even crashing.
In short if you don't have enough physical RAM, using virtual memory to implement huge sparse arrays can be disaster prone.
It is worth noting that the compute nodes on a large-scale HPC cluster will often be configured with ZERO backing storage for VM swapping. If an application then attempts to use more RAM than is present on the compute node it will error out. Immediately.
Is this true in Fortran?
Yes. Fortran doesn't have any special magic ...
Fortran is no different than say,C , because Fortran allocate typically does not call any low-level system functions but tends to be implemented using malloc() under the hood.
"Is this true in Fortran?"
The lazy allocation you describe is highly system dependent. It is indeed valid on modern Linux. However, it does not mean that it is a good idea to just allocate several 1 TB arrays and than just using certain sections of them. Even if it works in practice on one computer it may very much fail on a different one or on a different operating system or CPU family.
Re-allocation takes time, but it is the way to go to keep your programs standard conforming and undefined-behaviour free. Reallocating every time step may easily bee too slow. But in your previous answer we have showed you that for continuously growing arrays you typically allocate in a geometric series, e.g. by doubling the size. That means that it will only be re-allocated logarithmically often if it grows linearly.
There may be a concern of exceeding the system memory when allocating to the new size and having two copies at the same size. This is only a concern when your consumption high anyway. C has realloc() (which may not help anyway) but Fortran has nothing similar.
Regarding the title question, not every malloc takes the same time. There are is internal bookkeeping involved and the implementations do differ. Some points are raised at https://softwareengineering.stackexchange.com/questions/319015/how-efficient-is-malloc-and-how-do-implementations-differ and also to some extent at Minimizing the amount of malloc() calls improves performance?

Most efficient way to grow array C++

Apologies if this has been asked before, I can't find a question that fully answers what I want to know. They mention ways to do this, but don't compare approaches.
I am writing a program in C++ to solve a PDE to steady state. I don't know how many time steps this will take. Therefore I don't know how long my time arrays will be. This will have a maximum time of 100,000s, but the time step could be as small as .001, so it could be as many as 1e8 doubles in length in the worst case (not necessarily a rare case either).
What is the most efficient way to implement this in terms of memory allocated and running time?
Options I've looked at:
Dynamically allocating an array with 1e8 elements, most of which won't ever be used.
Allocating a smaller array initially, creating a larger array when needed and copying elements over
Using std::vector and it's size increasing functionality
Are there any other options?
I'm primarily concerned with speed, but I want to know what memory considerations come into it as well
If you are concerned about speed just allocate 1e8 doubles and be done with it.
In most cases vector should work just fine. Remember that amortized it's O(1) for the append.
Unless you are running on something very weird the OS memory allocation should take care of most fragmentation issues and the fact that it's hard to find a 800MB free memory block.
As noted in the comments, if you are careful using vector, you can actually reserve the capacity to store the maximum input size in advance (1e8 doubles) without paging in any memory.
For this you want to avoid the fill constructor and methods like resize (which would end up accessing all the memory) and use reserve and push_back to fill it and only touch memory as needed. That will allow most operating systems to simply page in chunks of your accessed vector at a time instead of the entire contents all at once.
Yet I tend to avoid this solution for the most part at these kinds of input scales, but for simple reasons:
A possibly-paranoid portability fear that I may encounter an operating system which doesn't have this kind of page-on-demand behavior.
A possibly-paranoid fear that the allocation may fail to find a contiguous set of unused pages and face out of memory errors (this is a grey zone -- I tend to worry about this for arrays which span gigabytes, hundreds of megabytes is borderline).
Just a totally subjective and possibly dumb/old bias towards not leaning too heavily on the operating system's behavior for paging in allocated memory, and preferring to have a data structure which simply allocates on demand.
Debugging.
Among the four, the first two could simply be paranoia. The third might just be plain dumb. Yet at least on operating systems like Windows, when using a debug build, the memory is initialized in its entirety early, and we end up mapping the allocated pages to DRAM immediately on reserving capacity for such a vector. Then we might end up leading to a slight startup delay and a task manager showing 800 megabytes of memory usage for a debug build even before we've done anything.
While generally the efficiency of a debug build should be a minor concern, when the potential discrepancy between release and debug is enormous, it can start to render production code almost incapable of being effectively debugged. So when the differences are potentially vast like this, my preference is to "chunk it up".
The strategy I like to apply here is to allocate smaller chunks -- smaller arrays of N elements, where N might be, say, 512 doubles (just snug enough to fit a common denominator page size of 4 kilobytes -- possibly minus a couple of doubles for chunk metadata). We fill them up with elements, and when they get full, create another chunk.
With these chunks, we can aggregate them together by either linking them (forming an unrolled list) or storing a vector of pointers to them in a separate aggregate depending on whether random-access is needed or merely sequential access will suffice. For the random-access case, this incurs a slight overhead, yet one I've tended to find relatively small at these input scales which often have times dominated by the upper levels of the memory hierarchy rather than register and instruction level.
This might be overkill for your case and a careful use of vector may be the best bet. Yet if that doesn't suffice and you have similar concerns/needs as I do, this kind of chunky solution might help.
The only way to know which option is 'most efficient' on your machine is to try a few different options and profile. I'd probably start with the following:
std::vector constructed with the maximum possible size.
std::vector constructed with a conservative ballpark size and push_back.
std::deque and push_back.
The std::vector vs std::deque debate is ongoing. In my experience, when the number of elements is unknown and not too large, std::deque is almost never faster than std::vector (even if the std::vector needs multiple reallocations) but may end up using less memory. When the number of elements is unknown and very large, std::deque memory consumption seems to explode and std::vector is the clear winner.
If after profiling, none of these options offers satisfactory performance, then you may want to consider writing a custom allocator.

Allocating ~10GB of vectors - how can I speed it up?

I'm loading about ~1000 files, each representing an array of ~3 million floats. I need to have them all in memory together as I need to do some calculations that involve all of them.
In the code below, I've broken out the memory allocation and file reading, so I can observe the speed of each separately. I was a bit surprised to find the memory allocation taking much longer than the file reading.
std::vector<std::vector<float> * > v(matrix_count);
for(int i=0; i < matrix_count; i++) {
v[i] = new std::vector<float>(array_size);
}
for(int i=0; i < matrix_count; i++) {
std::ifstream is(files[i]);
is.read((char*) &((*v[i])[0]), size);
is.close();
}
Measuring the time, the allocating loop took 6.8s while file loading took 2.5s. It seems counter-intuitive that reading from the disk is almost 3x faster than just allocating space for it.
Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.
Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.
I mainly wanted to respond by addressing this one part: bad_alloc exceptions tend to be misunderstood. They're not the result of "running out of memory" -- they're the result of the system failing to find a contiguous block of unused pages. You could have plenty more than enough memory available and still get a bad_alloc if you get in the habit of trying to allocate massive blocks of contiguous memory, simply because the system can't find a contiguous set of pages that are free. You can't necessarily avoid bad_alloc by "making sure plenty of memory is free" as you might have already seen where having over 100 gigabytes of RAM can still make you vulnerable to them when trying to allocate a mere 10 GB block. The way to avoid them is to allocate memory in smaller chunks instead of one epic array. At a large enough scale, structures like unrolled lists can start to offer favorable performance over a gigantic array and a much lower (exponentially) probability of ever getting a bad_alloc exception unless you actually do exhaust all the memory available. There is actually a peak where contiguity and the locality of reference it provides ceases to become beneficial and may actually hinder memory performance at a large enough size (mainly due to paging, not caching).
For the kind of epic scale input you're handling, you might actually get better performance out of std::deque given the page-friendly nature of it (it's one of the few times where deque can really shine without need for push_front vs. vector). It's something to potentially try if you don't need perfect contiguity.
Naturally it's best if you measure this with an actual profiler. It'll help us hone in on the actual problem, though it might not be completely shocking (surprising but maybe not shocking) that you might be bottlenecked by memory here instead of disk IO given the kind of "massive number of massive blocks" you're allocating (disk IO is slow but memory heap allocation can sometimes be expensive if you are really stressing the system). It depends a lot on the system's allocation strategy but even slab or buddy allocators can fall back to a much slower code branch if you allocate such epic blocks of memory and en masse, and allocations may even start to require something resembling a search or more access to secondary storage in those extreme cases (here I'm afraid I'm not sure exactly what goes on behind the hood when allocating so many massive blocks, but I have "felt" and measured these kinds of bottlenecks before but in a way where I never quite figured out what the OS was doing exactly -- this above paragraph is purely conjecture).
Here it's kind of counter-intuitive but you can often get better performance allocating a larger number of smaller blocks. Typically that makes things worse, but if we're talking about 3 million floats per memory block and a thousand memory blocks like it, it might help to start allocating in, say, page-friendly 4k chunks. Typically it's cheaper to pre-allocate memory in large blocks in advance and pool it, but "large" in this case is more like 4 kilobyte blocks, not 10 gigabyte blocks.
std::deque will typically do this kind of thing for you so it might be the quickest thing to try out to see if it helps. With std::deque, you should be able to make a single one for all 10 GB worth of contents without splitting it into smaller ones to avoid bad_alloc. It also doesn't have the zero-initialization overhead of the entire contents that some cited, and push_backs to it are constant-time even in the worst-case scenario (not amortized constant time as with std::vector), so I would try std::deque with actually push_back instead of pre-sizing it and using operator[]. You could read the file contents in small chunks at a time (ex: using 4k byte buffers) and just push back the floats. It's something to try anyway.
Anyway, these are all just educated guesses without code and profiling measurements, but these are some things to try out after your measurements.
MMFs may also be the ideal solution for this scenario. Let the OS handle all the tricky details of what it takes to access the file's contents.
Use multiple threads for both memory allocation and reading files. You can create a set of say 15 threads and let each thread pick up the next available job.
When you dig deeper, you will see that opening the file also has a considerable overhead which gets reduced substantially by using multiple threads.
You don't need to handle all the data in memory. Instead of that, you should use something like virtual vector which loads required data when needed. Using that approach saves the memory and don't brings your to side effects of huge memory allocation.

Is it possible to partially free dynamically-allocated memory on a POSIX system?

I have a C++ application where I sometimes require a large buffer of POD types (e.g. an array of 25 billion floats) to be held in memory at once in a contiguous block. This particular memory organization is driven by the fact that the application makes use of some C APIs that operate on the data. Therefore, a different arrangement (such as a list of smaller chunks of memory like std::deque uses) isn't feasible.
The application has an algorithm that is run on the array in a streaming fashion; think something like this:
std::vector<float> buf(<very_large_size>);
for (size_t i = 0; i < buf.size(); ++i) do_algorithm(buf[i]);
This particular algorithm is the conclusion of a pipeline of earlier processing steps that have been applied to the dataset. Therefore, once my algorithm has passed over the i-th element in the array, the application no longer needs it.
In theory, therefore, I could free that memory in order to reduce my application's memory footprint as it chews through the data. However, doing something akin to a realloc() (or a std::vector<T>::shrink_to_fit()) would be inefficient because my application would have to spend its time copying the unconsumed data to the new spot at reallocation time.
My application runs on POSIX-compliant operating systems (e.g. Linux, OS X). Is there any interface by which I could ask the operating system to free only a specified region from the front of the block of memory? This would seem to be the most efficient approach, as I could just notify the memory manager that, for example, the first 2 GB of the memory block can be reclaimed once I'm done with it.
If your entire buffer has to be in memory at once, then you probably will not gain much from freeing it partially later.
The main point on this post is basically to NOT tell you to do what you want to do, because the OS will not unnecessarily keep your application's memory in RAM if it's not actually needed. This is the difference between "resident memory usage" and "virtual memory usage". "Resident" is what is currently used and in RAM, "virtual" is the total memory usage of your application. And as long as your swap partition is large enough, "virtual" memory is pretty much a non-issue. [I'm assuming here that your system will not run out of virtual memory space, which is true in a 64-bit application, as long as you are not using hundreds of terabytes of virtual space!]
If you still want to do that, and want to have some reasonable portability, I would suggest building a "wrapper" that behaves kind of like std::vector and allocates lumps of some megabytes (or perhaps a couple of gigabytes) of memory at a time, and then something like:
for (size_t i = 0; i < buf.size(); ++i) {
do_algorithm(buf[i]);
buf.done(i);
}
The done method will simply check if the value if i is (one element) past the end of the current buffer, and free it. [This should inline nicely, and produce very little overhead on the average loop - assuming elements are actually used in linear order, of course].
I'd be very surprised if this gains you anything, unless do_algorithm(buf[i]) takes quite some time (certainly many seconds, probably many minutes or even hours). And of course, it's only going to help if you actually have something else useful to do with that memory. And even then, the OS will reclaim memory that isn't actively used by swapping it out to disk, if the system is short of memory.
In other words, if you allocate 100GB, fill it, leave it sitting without touching, it will eventually ALL be on the hard-disk rather than in RAM.
Further, it is not at all unusual that the heap in the application retains freed memory, and that the OS does not get the memory back until the application exits - and certainly, if only parts of a larger allocation is freed, the runtime will not release it until the whole block has been freed. So, as stated at the beginning, I'm not sure how much this will actually help your application.
As with everything regarding "tuning" and "performance improvements", you need to measure and compare a benchmark, and see how much it helps.
Is it possible to partially free dynamically-allocated memory on a POSIX system?
You can not do it using malloc()/realloc()/free().
However, you can do it in a semi-portable way using mmap() and munmap(). The key point is that if you munmap() some page, malloc() can later use that page:
create an anonymous mapping using mmap();
subsequently call munmap() for regions that you don't need anymore.
The portability issues are:
POSIX doesn't specify anonymous mappings. Some systems provide MAP_ANONYMOUS or MAP_ANON flag. Other systems provide special device file that can be mapped for this purpose. Linux provides both.
I don't think that POSIX guarantees that when you munmap() a page, malloc() will be able to use it. But I think it'll work an all systems that have mmap()/unmap().
Update
If your memory region is so large that most pages surely will be written to swap, you will not loose anything by using file mappings instead of anonymous mappings. File mappings are specified in POSIX.
If you can do without the convenience of std::vector (which won't give you much in this case anyway because you'll never want to copy / return / move that beast anyway), you can do your own memory handling. Ask the operating system for entire pages of memory (via mmap) and return them as appropriate (using munmap). You can tell mmap via its fist argument and the optional MAP_FIXED flag to map the page at a particular address (which you must ensure to be not otherwise occupied, of course) so you can build up an area of contiguous memory. If you allocate the entire memory upfront, then this is not an issue and you can do it with a single mmap and let the operating system choose a convenient place to map it. In the end, this is what malloc does internally. For platforms that don't have sys/mman.h, it's not difficult to fall back to using malloc if you can live with the fact that on those platforms, you won't return memory early.
I'm suspecting that if your allocation sizes are always multiples of the page size, realloc will be smart enough not to copy any data. You'd have to try this out and see if it works (or consult your malloc's documentation) on your particular target platform, though.
mremap is probably what you need. As long as you're shifting whole pages, you can do a super fast realloc (actually the kernel would do it for you).

Efficient way to truncate a std::vector<char> to length N - to free memory

I have several large std::vectors of chars (bytes loaded from binary files).
When my program runs out of memory, I need to be able to cull some of the memory used by these vectors. These vectors are almost the entirety of my memory usage, and they're just caches for local and network files, so it's safe to just grab the largest one and chop it in half or so.
Only thing is, I'm currently using vector::resize and vector::shrink_to_fit, but this seems to require more memory (I imagine for a reallocation of the new size) and then a bunch of time (for destruction of the now destroyed pointers, which I thought would be free?) and then copying the remaining to the new vector. Note, this is on Windows platform, in debug, so the pointers might not be destroyed in the Release build or on other platforms.
Is there something I can do to just say "C++, please tell the OS that I no longer need the memory located past location N in this vector"?
Alternatively, is there another container I'd be better off using? I do need to have random access though, or put effort into designing a way to easily keep iterators pointing at the place I'll want to read next, which would be possible, just not easy, so I'd prefer not using a std::list.
resize and shrink_to_fit are your best bets, as long as we are talking about standard C++, but these, as you noticed may not help at all if you are in a low memory situation to begin with: since the allocator interface do not provide a realloc-like operation, vector is forced to allocate a new block, copy the data in it and deallocate the old block.
Now, I see essentially four easy ways out:
drop whole vectors, not just parts of them, possibly using an LRU or stuff like that; working with big vectors, the C++ allocator normally just forwards to the OS's memory management calls, so the memory should go right back to the OS;
write your own container which uses malloc/realloc, or OS-specific functionality;
use std::deque instead; you lose guaranteed contiguousness of data, but, since deques normally allocate the space for data in distinct chunks, doing a resize+shrink_to_fit should be quite cheap - simply all the unused blocks at the end are freed, with no need for massive reallocations;
just leave this job to the OS. As already stated in the comments, the OS already has a file cache, and in normal cases it can handle it better than you or me, even just for the fact that it has a better vision of how much physical memory is left for that, what files are "hot" for most applications, and so on. Also, since you are working in a virtual address space you cannot even guarantee that your memory will actually stay in RAM; the very moment that the machine goes in memory pressure and you aren't using some memory pages so often, they get swapped to disk, so all your performance gain is lost (and you wasted space on the paging file for stuff that is already found on disk).
An additional way may be to just use memory mapped files - the system will do its own caching as usual, but you avoid any syscall overhead as long as the file remains in memory.
std::vector::shrink_to_fit() cannot result in more memory being used, if so it's a bug.
C++11 defines shrink_to_fit() as follows:
void shrink_to_fit(); Remarks: shrink_to_fit is a non-binding request to reduce capacity() to size(). [ Note: The request is non-binding to allow latitude for implementation-specific optimizations. — end note ]
As the note indicates, shrink_to_fit() may, but not necessarily, actually free memory, and the standard gives C++ implementations a free hand to recycle and optimize memory usage internally, as they see fit. C++ does not make it mandatory for shrink_to_fit(), and the like, to result in actually memory being released to the operating system, and in many cases the C++ runtime library may not actually be able to, as I'll get to in a moment. The C++ runtime library is allowed to take the freed memory, and stash it away internally, and reuse it automatically for the future memory allocation requests (explicit news, or container growth).
Most modern operating systems are not designed to allocate and release memory blocks of arbitrary sizes. Details differ, but typically an operating system allocates and deallocates memory in even chunks, typically 4Kb, or larger, at even memory page addresses. If you allocate a new object that's only a few hundred bytes long, the C++ library will request an entire page of memory to be allocated, take the first hundred bytes of it for the new object, then keep the spare amount of memory for future new requests.
Similarly, even if shrink_to_fit(), or delete, frees up a few hundred bytes, it can't go back to the operating system immediately, but only when an entire 4kb continuous memory range (or whatever is the allocation page size used by the operating system) -- suitably aligned -- is completely unused at all. Only then can a process release that page back to the operating system. Until then, the library keeps track of freed memory ranges, to be used for future new requests, without asking the operating system to allocate more pages of memory to the process.