Reducing memory footprint of c++ program utilising large vectors - c++

In scaling up the problem size I'm handing to a self-coded program I started to bump into Linux's OOM killer. Both Valgrind (when ran on CPU) and cuda-memcheck (when ran on GPU) do not report any memory leaks. The memory usage keeps expanding while iterating through the inner loop, while I explicitly clear the vectors holding the biggest chunk of data at the end of the this loop. How can I ensure this memory hogging will disappear?
Checks for memory leaks were performed, all the memory leaks are fixed. Despite this, Out of Memory errors keep killing the program (via the OOM Killer). Manual monitoring of memory consumption shows an increase in memory utilisation, even after explicitly clearing the vectors containing the data.
Key to know is having three nested loops, one outer containing the sub-problems at hand. The middle loop loops over the Monte Carlo trials, with an inner loop running some sequential process required inside the trial. Pseudo-code looks as follows:
std::vector<object*> sub_problems;
sub_problems.push_back(retrieved_subproblem_from_database);
for(int sub_problem_index = 0; sub_problem_index < sub_problems.size(); ++sub_problem_index){
std::vector< std::vector<float> > mc_results(100000, std::vector<float>(5, 0.0));
for(int mc_trial = 0; mc_trial < 100000; ++mc_trial){
for(int sequential_process_index = 0; sequential_process_index < 5; ++sequential_process_index){
mc_results[mc_trial][sequential_process_index] = specific_result;
}
}
sub_problems[sub_problem_index]->storeResultsInObject(mc_results);
// Do some other things
sub_problems[sub_problem_index]->deleteMCResults();
}
deleteMCResults looks as follows:
bool deleteMCResults() {
for (int i = 0; i < asset_values.size(); ++i){
object_mc_results[i].clear();
object_mc_results[i].shrink_to_fit();
}
object_mc_results.clear();
object_mc_results.shrink_to_fit();
return true;
}
How can I ensure memory consumption to be solely dependent on the middle and inner loop instead of the outer loop? The second, and third and fourth and so, could theoretically use exactly the same memory space/addresses as utilised for the first iteration.

Perhaps I'm reading your pseudocode too literally, but it looks like you have two mc_results variables, one declared inside the for loop and one that deleteMCResults is accessing.
In any case, I have two suggestions for how to debug this. First, rather than letting the OOM killer strike, which takes a long time, is unpredictable, and might kill something important, use ulimit -v to put a limit on process size. Set it to something reasonable like, say, 1000000 (about 1GB) and work on keeping your process under that.
Second, start deleting or commenting out everything except the parts of the program that allocate and deallocate memory. Either you will find your culprit or you will make a program small enough to post in its entirety.

deleteMCResults() can be written a lot simpler.
void deleteMCResults() {
decltype(object_mc_results) empty;
std::swap(object_mc_results, empty);
}
But in this case, I'm wondering if you really want to release the memory. As you say, the iterations could reuse the same memory, so perhaps you should replace deleteMCResults() with returnMCResultsMemory(). Then hoist the declaration of mc_results out of the loop, and just reset its values to 5.0 after returnMCResultsMemory() returns.

There is one thing that could easily be improved from the code you show. However, it is really not enough and not precise enough info to make a full analysis. Extracting a relevant example ([mcve]) and perhaps asking for a review on codereview.stackexchange.com might improve the outcome.
The simple thing that could be done is to replace the inner vector of five floats with an array of five floats. Each vector consists (in typical implementations) of three pointers, to the beginnig and end of the allocated memory and another one to mark the used amount. The actual storage requires a separate allocation, which in turn incurs some overhead (and also performance overhead when accessing the data, keyword "locality of reference"). These three pointers require 24 octets on a common 64-bit machine. Compare that with five floats, those only require 20 octets. Even if those floats were padded to 24 octets, you would still benefit from eliding the separate allocation.
In order to try this out, just replace the inner vector with a std::array (https://en.cppreference.com/w/cpp/container/array). Odds are that you won't have to change much code, raw arrays, std::array and std::vector have very similar interfaces.

Related

OpenMP first kernel much slower than the second kernel

I have a huge 98306 by 98306 2D array initialized. I created a kernel function that counts the total number of elements below a certain threshold.
#pragma omp parallel for reduction(+:num_below_threshold)
for(row)
for(col)
index = get_corresponding_index(row, col);
if (array[index] < threshold)
num_below_threshold++;
For benchmark purpose I measured the execution time of the kernel executing when the number of thread is set to 1. I noticed that the first time the kernel executes it took around 11 seconds. The next call to the kernel executing on the same array with one thread only took around 3 seconds. I thought it might be a problem related to cache but it doesn't seem to be related. What is the possible reasons that caused this?
This array is initialized as:
float *array = malloc(sizeof(float) * 98306 * 98306);
for (int i = 0; i < 98306 * 98306; i++) {
array[i] = rand() % 10;
}
This same kernel is applied to this array twice and the second execution time is much faster than the first kernel. I though of lazy allocation on Linux but that shouldn't be a problem because of the initialization function. Any explanations will be helpful. Thanks!
Since you don't provide any Minimal, Complete and Verifiable Example, I'll have to make some wild guesses here, but I'm pretty confident I have the gist of the issue.
First, you have to notice that 98,306 x 98,306 is 9,664,069,636 which is way larger than the maximum value a signed 32 bit integer can store (which is 2,147,483,647). Therefore, the upper limit of your for initialization loop, after overflowing, could become 1,074,135,044 (as on my machines, although it is undefined behavior so strictly speaking, anything could happen), which is roughly 9 times smaller than what you expected.
So now, after the initialization loop, only 11% of the memory you thought you allocated has actually been allocated and touched by the operating system. However, your first reduction loop does a good job in going over the various elements of the array, and since for about 89% of it, it's for the fist time, the OS does the actual memory allocation there and then, which takes some significant amount of time.
And now, for your second reduction loop, all memory has been properly allocated and touched, which makes it much faster.
So that's what I believe happened. That said, many other parameters can enter into play here, such as:
Swapping: the array you try to allocate represents about 36GB of memory. If your machine doesn't have that much memory available, then your code might swap, which will potentially make a big mess of whatever performance measurement you can come up with
NUMA effect: if your machine has multiple NUMA nodes, then thread pinning and memory affinity, when not managed properly, can have a large impact on performance between loop occurrences
Compiler optimization: you didn't mention which compiler you used and which level of optimization you requested. Depending on that, you'd be amazed on how shortened your code could become. For example, the compiler could totally remove the second loop as it does the same thing as the first and becomes useless as the result will be the same... And many other interesting and unexpected things which render your benchmark meaningless

Is every element access in std::vector a cache miss?

It's known that std::vector hold its data on the heap so the instance of the vector itself and the first element have different addresses. On the other hand, std::array is a lightweight wrapper around a raw array and its address is equal to the first element's address.
Let's assume that the sizes of collections is big enough to hold one cache line of int32. On my machine with 384kB L1 cache it's 98304 numbers.
If I iterate the std::vector it turns out that I always access first the address of the vector itself and next access element's address. And probably this addresses are not in the same cache line. So every element access is a cache miss.
But if I iterate std::array addresses are in the same cache line so it should be faster.
I tested with VS2013 with full optimization and std::array is approx 20% faster.
Am I right in my assumptions?
Update: in order to not create the second similar topic. In this code I have an array and some local variable:
void test(array<int, 10>& arr)
{
int m{ 42 };
for (int i{ 0 }; i < arr.size(); ++i)
{
arr[i] = i * m;
}
}
In the loop I'm accessing both an array and a stack variable which are placed far from each other in memory. Does that mean that every iteration I'll access different memory and miss the cache?
Many of the things you've said are correct, but I do not believe that you're seeing cache misses at the rate that you believe you are. Rather, I think you're seeing other effects of compiler optimizations.
You are right that when you look up an element in a std::vector, that there are two memory reads: first, a memory read for the pointer to the elements; second, a memory read for the element itself. However, if you do multiple sequential reads on the std::vector, then chances are that the very first read you do will have a cache miss on the elements, but all successive reads will either be in cache or be unavoidable. Memory caches are optimized for locality of reference, so whenever a single address is pulled into cache a large number of adjacent memory addresses are pulled into the cache as well. As a result, if you iterate over the elements of a std::vector, most of the time you won't have any cache misses at all. The performance should look quite similar to that for a regular array. It's also worth remembering that the cache stores multiple different memory locations, not just one, so the fact that you're reading both something on the stack (the std::vector internal pointer) and something in the heap (the elements), or two different elements on the stack, won't immediately cause a cache miss.
Something to keep in mind is that cache misses are extremely expensive compared to cache hits - often 10x slower - so if you were indeed seeing a cache miss on each element of the std::vector you wouldn't see a gap of only 20% in performance. You'd see something a lot closer to a 2x or greater performance gap.
So why, then, are you seeing a difference in performance? One big factor that you haven't yet accounted for is that if you use a std::array<int, 10>, then the compiler can tell at compile-time that the array has exactly ten elements in it and can unroll or otherwise optimize the loop you have to eliminate unnecessary checks. In fact, the compiler could in principle replace the loop with 10 sequential blocks of code that all write to a specific array element, which might be a lot faster than repeatedly branching backwards in the loop. On the other hand, with equivalent code that uses std::vector, the compiler can't always know in advance how many times the loop will run, so chances are it can't generate code that's as good as the code it generated for the array.
Then there's the fact that the code you've written here is so small that any attempt to time it is going to have a ton of noise. It would be difficult to assess how fast this is reliably, since something as simple as just putting it into a for loop would mess up the cache behavior compared to a "cold" run of the method.
Overall, I wouldn't attribute this to cache misses, since I doubt there's any appreciably different number of them. Rather, I think this is compiler optimization on arrays whose sizes are known statically compared with optimization on std::vectors whose sizes can only be known dynamically.
I think it has nothing to do with cache miss.
You can take std::array as a wrapper of raw array, i.e. int arr[10], while vector as a wrapper of dynamic array, i.e. new int[10]. They should have the same performance. However, when you access vector, you operate on the dynamic array through pointers. Normally the compiler might optimize code with array better than code with pointers. And that might be the reason you get the test result: std::array is faster.
You can have a test that replacing std::array with int arr[10]. Although std::array is just a wrapper of int arr[10], you might get even better performance (in some case, the compiler can do better optimization with raw array). You can also have another test that replacing vector with new int[10], they should have equal performance.
For your second question, the local variable, i.e. m, will be saved in register (if optimized properly), and there will be no access to the memory location of m during the for loop. So it won't be a problem of cache miss either.

Trying to de allocate a vector with a while loop, task manager showing that memory use is increasing

Okay so I'm trying to learn about vectors, and I made some code to de-allocate the vector:
while (!myVector.empty())
{
myVector.pop_back();
myVector.shrink_to_fit();
}
I expected this to work, but in fact, it increases memory use, and my program is stuck in this loop forever. I have realized that the culprit is the shrink_to_fit function, if I remove it from the loop and only call it once the loop is done, it properly de allocates the memory. But why doesn't it work when I put shrink_to_fit in the loop? I have tried both orientations on the loop including this:
while (!myVector.empty())
{
myVector.shrink_to_fit();
myVector.pop_back();
}
But that doesn't work either. Also, before anyone says so, I am aware that this isn't the most elegant or efficient way to delete vectors.
Edit: As much as I appreciate the answers, I still have absolutely no clue why this isn't just deleting, shrinking, and repeating. I also have no clue why my loop is looping forever, when It should stop once the vector is empty.
Edit: Full source:
#include <iostream>
#include <vector>
using namespace std;
int main()
{
vector<string> myVector;
cout << "Begin allocation" << endl;
getchar();
while (myVector.size() < 1000000)
{
myVector.push_back("Nothing Here");
}
cout << "Begin de-allocation" << endl;
getchar();
while (!myVector.empty())
{
myVector.pop_back();
myVector.shrink_to_fit();
}
cout << myVector.size() << endl;
getchar();
}
I should probably state my environment because Neil Kirk has said that this shouldn't cause an infinite loop: I'm using Visual Studio 2013 Express without any change to the command line, I am running windows 8.1 and I am using Task Manager to monitor the memory usage. Also, the above source is the FULL source, I have not clipped anything off.
Edit: All right, all right I've received quite a bit of negative attention for producing the world's most inefficient algorithm :P, but nonetheless, those answers have been helpful. Yes I know it's inefficient, even before posting the source, but first of all It was an experiment, second of all adding a whole lot of elements to the vector was the only way for me to detect the fluctuations in task manager. I have now realized that it wasn't, as I originally thought, an infinite loop. It just takes a while to copy a million or so elements.
Every time you call shrink_to_fit your almost 1000000 element vector is reallocated, then each element is moved to the new version, then the old version is deallocated.
You do this at 1000000 at 999999 at 999998 at 999997 at 999996 at 999995 etc.
This results in 500000500000 std::string moves and 1 million allocations during the shrinking portion of your code.
Having requested half-a-trillion operations, it takes a while.
If you don't want the code to do nearly useless things half a trillion times, think about not shrink_to_fiting unless it would matter. As an example, only shrink_to_fit if size() is less than 2/3 capacity().
The C++ runtime does not immediately return unused memory to the OS. When you allocate a million blocks and deallocate them again, you end up with pages of memory owned by the process that are not in use. Ideally, what would happen is you'd swap between two different huge buffers, with the previously deallocated buffer being used on the next allocation.
vector<T>::shrink_to_fit() will try to reduce the capacity. In the best case you have that .capacity() is the same as .size() after the call, but that's not ensured. In order to reduce the capacity, Ts move constructor is going to be used in order to move the old values into a new storage. Instead of the usual O(n), you end up with O(n²).
Note that you should need only about twice the original vectors memory. Depending on your operating system and compiler, deallocations might not release memory to the OS directly. On my PC (Win 8.1, MinGW-G++ 4.8.2 from nuwen), your original vector takes around 55k, and the de-allocation lets the program fluctuate around 59-63k.
There's definitely some deallocation, but getting rid of 1,000,000 elements will take a long time, because allocations are slow as hell, and (in worst case) you're allocating in every step of your while loop.

std::sort on container of pointers

I want to explore the performance differences for multiple dereferencing of data inside a vector of new-ly allocated structs (or classes).
struct Foo
{
int val;
// some variables
}
std::vector<Foo*> vectorOfFoo;
// Foo objects are new-ed and pushed in vectorOfFoo
for (int i=0; i<N; i++)
{
Foo *f = new Foo;
vectorOfFoo.push_back(f);
}
In the parts of the code where I iterate over vector I would like to enhance locality of reference through the many iterator derefencing, for example I have very often to perform a double nested loop
for (vector<Foo*>::iterator iter1 = vectorOfFoo.begin(); iter!=vectorOfFoo.end(); ++iter1)
{
int somevalue = (*iter)->value;
}
Obviously if the pointers inside the vectorOfFoo are very far, I think locality of reference is somewhat lost.
What about the performance if before the loop I sort the vector before iterating on it? Should I have better performance in repeated dereferencings?
Am I ensured that consecutive ´new´ allocates pointer which are close in the memory layout?
Just to answer your last question: no, there is no guarantee whatsoever where new allocates memory. The allocations can be distributed throughout the memory. Depending on the current fragmentation of the memory you may be lucky that they are sometimes close to each other but no guarantee is - or, actually, can be - given.
If you want to improve the locality of reference for your objects then you should look into Pool Allocation.
But that's pointless without profiling.
It depends on many factors.
First, it depends on how your objects that are being pointed to from the vector were allocated. If they were allocated on different pages then you cannot help it but fix the allocation part and/or try to use software prefetching.
You can generally check what virtual addresses malloc gives out, but as a part of the larger program the result of separate allocations is not deterministic. So if you want to control the allocation, you have to do it smarter.
In case of NUMA system, you have to make sure that the memory you are accessing is allocated from the physical memory of the node on which your process is running. Otherwise, no matter what you do, the memory will be coming from the other node and you cannot do much in that case except transfer you program back to its "home" node.
You have to check the stride that is needed in order to jump from one object to another. Pre-fetcher can recognize the stride within 512 byte window. If the stride is greater, you are talking about a random memory access from the pre-fetcher point of view. Then it will shut off not to evict your data from the cache, and the best you can do there is to try and use software prefetching. Which may or may not help (always test it).
So if sorting the vector of pointers makes the objects pointed by them continuously placed one after another with a relatively small stride - then yes, you will improve the memory access speed by making it more friendly for the prefetch hardware.
You also have to make sure that sorting that vector doesn't result in a worse gain/lose ratio.
On a side note, depending on how you use each element, you may want to allocate them all at once and/or split those objects into different smaller structures and iterate over smaller data chunks.
At any rate, you absolutely must measure the performance of the whole application before and after your changes. These sort of optimizations is a tricky business and things can get worse even though in theory the performance should have been improved. There are many tools that can be used to help you profile the memory access. For example, cachegrind. Intel's VTune does the same. And many other tools. So don't guess, experiment and verify the results.

Multithreading not taking advantage of multiple cores?

My computer is a dual core core2Duo. I have implemented multithreading in a slow area of my application but I still notice cpu usage never exceeds 50% and it still lags after many iterations. Is this normal? I was hopeing it would get my cpu up to 100% since im dividing it into 4 threads. Why could it still be capped at 50%?
Thanks
See What am I doing wrong? (multithreading)
for my implementation, except I fixed the issue that that code was having
Looking at your code, you are making a huge number of allocations in your tight loop--in each iteration you dynamically allocate two, two-element vectors and then push those back onto the result vector (thus making copies of both of those vectors); that last push back will occasionally cause a reallocation and a copy of the vector contents.
Heap allocation is relatively slow, even if your implementation uses a fast, fixed-size allocator for small blocks. In the worst case, the general-purpose allocator may even use a global lock; if so, it will obliterate any gains you might get from multithreading, since each thread will spend a lot of time waiting on heap allocation.
Of course, profiling would tell you whether heap allocation is constraining your performance or whether it's something else. I'd make two concrete suggestions to cut back your heap allocations:
Since every instance of the inner vector has two elements, you should consider using a std::array (or std::tr1::array or boost::array); the array "container" doesn't use heap allocation for its elements (they are stored like a C array).
Since you know roughly how many elements you are going to put into the result vector, you can reserve() sufficient space for those elements before inserting them.
From your description we have very little to go on, however, let me see if I can help:
You have implemented a lock-based system but you aren't judiciously using the resources of the second, third, or fourth threads because the entity that they require is constantly locked. (this is a very real and obvious area I'd look into first)
You're not actually using more than a single thread. Somehow, somewhere, those other threads aren't even fired up or initialized. (sounds stupid but I've done this before)
Look into those areas first.