Memory with Vectors and Pointers in MPI+OpenMP App? - c++

I have been reading about vectors with respect to memory allocation and trying to work through some hybrid parallelized code that seems to be heavily chewing through memory unexpectedly. The program used to use only OpenMP so was limited in processing the data on a per node basis, but code was recently added to utilize MPI as well. The data itself is of a fixed size for the current problem (15GB) and is distributed equally amongst all of the MPI processes.
The program overall is using vectors almost everywhere, but prior to MPI code, was able to process much much larger amounts of data without running out of mem.
The current data it is trying to process is only 15gb in size, but any attempts to run are wrought with bad_alloc errors -- even when requesting nodes with >64GB mem.
I am shortening down the code to explain the critical points and adding some inline comments. It is mixed MPI+openmp.
Here are my questions:
If I am running out of memory on a per node basis, does that definitively mean that its the OpenMP threads that are causing the problem?
Could it be that increasing these vectors on-the-fly is causing the memory issue? It seems like the vectors are only used for storing data and then for the allocation of other memory -- could I get away with using a std::list instead?
The vectors are never used after this step, but that mem remains allocated. I could certainly free the memory associated with them, but these vectors were not previously a problem when they ran only on a single node over a series of ~50 hours on the full set of data (not a smaller chunk). Is there anything about the MPI code below that could be causing it to ask for too much memory in some way? Previously, the mallocs for the pointers were not in the code.
svec<int> ind_leftFW;
svec<int> ind_rightFW;
svec<int> ind_leftRC;
svec<int> ind_rightRC;
svec<int> ind_toasted;
// For additional examples, I will be using only leftFront and rightFront but all of these have similar behavior
/* Searching through previously created [outside of function] vectors for things + some math */
#pragma omp critical
{
// first use of declared vectors
ind_leftFront.push_back(i);
ind_rightFront.push_back(c);
}
/* Other push_backs into the other vector structures */
int* all_indexFront_sizes;
int* all_IndexFront_left;
int* all_IndexFront_right;
int* leftFront;
int* rightFront;
int numFront;
numFront = (int)ind_leftFront.size();
leftFront = (int*)malloc(sizeof(int)*numFront);
rightFront = (int*)malloc(sizeof(int)*numFront);
for(int count=0; count<numFront; count++) {
leftFront[count] = ind_leftFront[count];
rightFront[count] = ind_rightFront[count];
}
all_indexFront_sizes = (int*)malloc(sizeof(int)*numranks);
MPI_Allgather(&numFront,1,MPI_INT,all_indexFront_sizes,1,MPI_INT,MPI_COMM_WORLD);
int* Front_displs;
Front_displs = (int*)malloc(sizeof(int)*numranks);
Front_displs[0] = 0;
for(int count=1;count<numranks;count++)
Front_displs[count] = Front_displs[count-1] + all_indexFront_sizes[count-1];
all_IndexFront_left = (int*)malloc(sizeof(int)*total_indexFront);
all_IndexFront_right = (int*)malloc(sizeof(int)*total_indexFront);
MPI_Allgatherv(leftFront, numFront,MPI_INT,all_IndexFront_left ,all_indexFront_sizes, FW_displs, MPI_INT,MPI_COMM_WORLD);
MPI_Allgatherv(rightFront,numFront,MPI_INT,all_IndexFront_right,all_indexFront_sizes, FW_displs, MPI_INT,MPI_COMM_WORLD);
free(Front_displs);
free(all_indexFront_sizes);
free(all_IndexFront_left);
free(all_IndexFront_right);
free(leftFront);
free(rightFront);

Related

Reducing memory footprint of c++ program utilising large vectors

In scaling up the problem size I'm handing to a self-coded program I started to bump into Linux's OOM killer. Both Valgrind (when ran on CPU) and cuda-memcheck (when ran on GPU) do not report any memory leaks. The memory usage keeps expanding while iterating through the inner loop, while I explicitly clear the vectors holding the biggest chunk of data at the end of the this loop. How can I ensure this memory hogging will disappear?
Checks for memory leaks were performed, all the memory leaks are fixed. Despite this, Out of Memory errors keep killing the program (via the OOM Killer). Manual monitoring of memory consumption shows an increase in memory utilisation, even after explicitly clearing the vectors containing the data.
Key to know is having three nested loops, one outer containing the sub-problems at hand. The middle loop loops over the Monte Carlo trials, with an inner loop running some sequential process required inside the trial. Pseudo-code looks as follows:
std::vector<object*> sub_problems;
sub_problems.push_back(retrieved_subproblem_from_database);
for(int sub_problem_index = 0; sub_problem_index < sub_problems.size(); ++sub_problem_index){
std::vector< std::vector<float> > mc_results(100000, std::vector<float>(5, 0.0));
for(int mc_trial = 0; mc_trial < 100000; ++mc_trial){
for(int sequential_process_index = 0; sequential_process_index < 5; ++sequential_process_index){
mc_results[mc_trial][sequential_process_index] = specific_result;
}
}
sub_problems[sub_problem_index]->storeResultsInObject(mc_results);
// Do some other things
sub_problems[sub_problem_index]->deleteMCResults();
}
deleteMCResults looks as follows:
bool deleteMCResults() {
for (int i = 0; i < asset_values.size(); ++i){
object_mc_results[i].clear();
object_mc_results[i].shrink_to_fit();
}
object_mc_results.clear();
object_mc_results.shrink_to_fit();
return true;
}
How can I ensure memory consumption to be solely dependent on the middle and inner loop instead of the outer loop? The second, and third and fourth and so, could theoretically use exactly the same memory space/addresses as utilised for the first iteration.
Perhaps I'm reading your pseudocode too literally, but it looks like you have two mc_results variables, one declared inside the for loop and one that deleteMCResults is accessing.
In any case, I have two suggestions for how to debug this. First, rather than letting the OOM killer strike, which takes a long time, is unpredictable, and might kill something important, use ulimit -v to put a limit on process size. Set it to something reasonable like, say, 1000000 (about 1GB) and work on keeping your process under that.
Second, start deleting or commenting out everything except the parts of the program that allocate and deallocate memory. Either you will find your culprit or you will make a program small enough to post in its entirety.
deleteMCResults() can be written a lot simpler.
void deleteMCResults() {
decltype(object_mc_results) empty;
std::swap(object_mc_results, empty);
}
But in this case, I'm wondering if you really want to release the memory. As you say, the iterations could reuse the same memory, so perhaps you should replace deleteMCResults() with returnMCResultsMemory(). Then hoist the declaration of mc_results out of the loop, and just reset its values to 5.0 after returnMCResultsMemory() returns.
There is one thing that could easily be improved from the code you show. However, it is really not enough and not precise enough info to make a full analysis. Extracting a relevant example ([mcve]) and perhaps asking for a review on codereview.stackexchange.com might improve the outcome.
The simple thing that could be done is to replace the inner vector of five floats with an array of five floats. Each vector consists (in typical implementations) of three pointers, to the beginnig and end of the allocated memory and another one to mark the used amount. The actual storage requires a separate allocation, which in turn incurs some overhead (and also performance overhead when accessing the data, keyword "locality of reference"). These three pointers require 24 octets on a common 64-bit machine. Compare that with five floats, those only require 20 octets. Even if those floats were padded to 24 octets, you would still benefit from eliding the separate allocation.
In order to try this out, just replace the inner vector with a std::array (https://en.cppreference.com/w/cpp/container/array). Odds are that you won't have to change much code, raw arrays, std::array and std::vector have very similar interfaces.

Linux: /proc/self/statm is it trustable?

My main task is to find out how much memory a process is using to do different things. I am reading the RSS from statm file before and after doing something, then I subtract this two values to know how much memory the process is using to do this something.
For example, in this picture you will see the memory I measured to multiply sparse matrices of different sizes and densities. Notice how odd it is that matrices of size 100x100,200x200 and 300x300 take nothing into consideration on the RSS increase. A bunch of other stuff I am doing I am also getting odd zeros. Am I missing something here? Am I measuring the memory the wrong way? Please fell free to point out any better way you know to measure memory usage by piece of code.
I tried using rgetusage that brings the peak usage by the process and it seems worst.
EDIT: I am coding on C++. I am allocating the matrices outside of main with a function using malloc:
int **createMatrix(int N, int M)
{
int i, **table;
table = (int**)malloc(N*sizeof(int *));
for(i = 0 ; i < N ; i++)
table[i] = (int*)malloc( M*sizeof(int) );
return table;
}
massif is fairly good at tracking down memory usage in the general case. I would recommend it in conjunction with massif-visualizer
Regarding the 0's, keep in mind that this is OS memory. Most likely your code is not allocating directly from the OS, but from an allocator in a standard library (such as the default malloc allocator in the standard C library). These allocate large blocks from the OS and then split them up to fulfill allocation requests. If there was enough free space in an existing block to satisfy the allocation request, no more blocks are requested. Things get more complicated if multiple threads are involved.
More fine-grained tracking than what is in proc would require you to tell us the programming language and allocation mechanisms used. Most allocators provide these stats somewhere. For example, GNU libc has mallinfo.

(Win7:VC7): TaskMgr suggests "Memory Leak" as Two-Dimensional Static Arrays are Filled, Despite Initialization

I have an MFC application compiled under VS VC7. I have a persistent and worrisome "memory leak" per Task Manager.
I downloaded a professional memory leak detection application, which highlighted some one-off calls to "new" in the VC7 core libraries related to sockets. So, I commented out all of the (one-time) calls from the main program to connect via the sockets. But, the leak persists.
Continuing to comment out whole sections of code, I seem to have isolated the main culprit, and I am very confused.
If I comment out the call to the following function (circular buffer, crude spin lock, called every second in a timer loop), taskmgr suggests there are essentially no more leaks:
void CIBTraderDlg::PushPriceVolume()
{
INT i;
InterlockedIncrement(&g_nNowRow);
if ( g_nNowRow == MAX_HISTORY) InterlockedExchange(&g_nNowRow, 0);
for ( i = 0; i < g_Config.m_nFutureCount + g_Config.m_nSpreadCount; i++ )
{
pPH[i][g_nNowRow] = g_Config.m_pFutureInfos[i].dBid;
pVH[i][g_nNowRow] = g_Config.m_pFutureInfos[i].nThisV;
}
if ( nHistoryCount < MAX_HISTORY ) InterlockedIncrement(&nHistoryCount);
}
The two arrays pPH and pVH are initialized as follows, global at the top:
DOUBLE pPH[MAX_CONTRACTS][MAX_HISTORY] = {0.00};
INT pVH[MAX_CONTRACTS][MAX_HISTORY] = {0};
both MAX_CONTRACTS and MAX_HISTORY known.
These are big arrays, as it happens. MAX_HISTORY = 86400 (seconds in a day)
(nb: I've done a lot of math and the arrays are being filled with the correct values)
Based on taskmgr, as the array is filled up with actual data the size of the program in memory grows, and dramatically (10K bytes/minute, for an application that is to run 7 x 24). But the array was defined static and of fixed size, and was to be prepopulated with zeroes. My thought had been, I have plenty of RAM and by doing it the way I did the size of the array in memory was fixed. Obviously, I was wrong. I've spent the last 2 days reading thru many articles on "better" and more elegant and efficient ways to allocate for multidimensional arrays, but I haven't read anywhere that what I did would not work.

Memory leakage when creating object in a loop

I am new to C++ and memory management. I have a code that is to build up a graph composed of objects of type vertex(~100 bytes each) and edge(~ 50 bytes each). My code works fine when the graph is small, but with the real data that has ~ 3M vertexes and ~ 10M edges, I get the run time error: std::bad_alloc when "new" is used (and not always with the same new).
This, based on what I have gathered, is the effect of memory leakage in my program that makes new memory allocation fail. My question is what is wrong with the way I am allocating memory and more importantly how can I fix it. Here is roughly what I do:
In the graph class constructor I create the array repository for the objects of class vertex:
graph:graph()
{
// vertexes is a class varaible
vertexes = new vertex *[MAX_AR_LEN];// where MAX_AR_LEN = 3M
}
I then call a function like this to iteratively build obj vertexes and assign them to array.
void graph::buildVertexes()
{
for(int i=0; i<v_num; i++)
vertexes[i] = new vertex(strName);
}
I then complete other tasks and at the end before the program ends I have a destructor that explicitly deletes the graph object
graph:~graph()
{
delete[] vertexes;
vertexes = 0;
}
Where is the leak happening. I am creating a lot of objects but nothing to my knowledge that could be deleted and remains undeleted.
I have been dealing with this for over a week now with not much luck. Thank you very much for your help!
EDIT (after solving the issue):
Thanks all for the help. Looking back, with the info I provided it is hard to pinpoint what was going on. I solved the issues and here are the very obvious points that I took away; so obvious that might not worth sharing, but here they are anyway:
When dealing with lots of objects that need to exist on the memory simultaneously, before coding use your best estimation to find the minimal memory you need. In my case even without a leakage, I would have almost maxed out on memory. I just needed better estimates of memory use to figure that out.
As you go along developing your code, frequently using vld.h (or other alternatives) can be helpful in checking that your design is free of memory leakage. Doing this at the end could be a lot more complicated, and even if you find the leakage, it might be harder to fix.
Let’s say you did all these and you expect to have enough memory to run the code but you get std::bad_alloc run time error when there seems to be plenty of free memory available on your system. You might be compiling for 32 bit platform, switching to 64 bit will allow allocation of more memory from what’s available (for visual studio: ).
Use of vectors instead of arrays as suggested by many here is a helpful approach to avoid a common route for leakage (and for other conveniences), but let’s say you have memory leakage and you have arrays. As arrays are not necessarily the cause of leakage (obviously), switching to vectors might not serve you. Looking at array deletion though is a good start. Here is what I gathered for how to properly delete an array of pointers to objects:
//Let's say we have
objType **objAr = new objType[ aNum];
for(int i=0; i<objNum; i++)
{
ObjAr[i] = new objType();
}
// to delete:
for(int i=0; i<objNum; i++)
{
delete objAr[i];
}
// If instead of array of pointers we had just
// an array of objects loop wasn't needed
delete [] objAr;
objAr = 0;
Ironically, a source of leakage in my code was improper deletion of a vector of pointers to objects. For vectors I needed to first delete element by element and then do a vec.clear(). Just doing the latter was causing memory leakage.
Look how many times you use new. You use it once to allocate the array of pointers (new vertex *[MAX_AR_LEN]) and then you use it v_num times to allocate each vertex. To avoid memory leaks, you have to use delete the same number of times you use new, so that you deallocate everything that you allocated.
You're going to have to loop through your array of pointers and do delete vertexes[i] on each one.
However, if you had used a std::vector<vertex>, you would not have to deal with this manual memory allocation and would avoid these kinds of problems.
Note that the plural of "vertex" is "vertices"

std::sort on container of pointers

I want to explore the performance differences for multiple dereferencing of data inside a vector of new-ly allocated structs (or classes).
struct Foo
{
int val;
// some variables
}
std::vector<Foo*> vectorOfFoo;
// Foo objects are new-ed and pushed in vectorOfFoo
for (int i=0; i<N; i++)
{
Foo *f = new Foo;
vectorOfFoo.push_back(f);
}
In the parts of the code where I iterate over vector I would like to enhance locality of reference through the many iterator derefencing, for example I have very often to perform a double nested loop
for (vector<Foo*>::iterator iter1 = vectorOfFoo.begin(); iter!=vectorOfFoo.end(); ++iter1)
{
int somevalue = (*iter)->value;
}
Obviously if the pointers inside the vectorOfFoo are very far, I think locality of reference is somewhat lost.
What about the performance if before the loop I sort the vector before iterating on it? Should I have better performance in repeated dereferencings?
Am I ensured that consecutive ´new´ allocates pointer which are close in the memory layout?
Just to answer your last question: no, there is no guarantee whatsoever where new allocates memory. The allocations can be distributed throughout the memory. Depending on the current fragmentation of the memory you may be lucky that they are sometimes close to each other but no guarantee is - or, actually, can be - given.
If you want to improve the locality of reference for your objects then you should look into Pool Allocation.
But that's pointless without profiling.
It depends on many factors.
First, it depends on how your objects that are being pointed to from the vector were allocated. If they were allocated on different pages then you cannot help it but fix the allocation part and/or try to use software prefetching.
You can generally check what virtual addresses malloc gives out, but as a part of the larger program the result of separate allocations is not deterministic. So if you want to control the allocation, you have to do it smarter.
In case of NUMA system, you have to make sure that the memory you are accessing is allocated from the physical memory of the node on which your process is running. Otherwise, no matter what you do, the memory will be coming from the other node and you cannot do much in that case except transfer you program back to its "home" node.
You have to check the stride that is needed in order to jump from one object to another. Pre-fetcher can recognize the stride within 512 byte window. If the stride is greater, you are talking about a random memory access from the pre-fetcher point of view. Then it will shut off not to evict your data from the cache, and the best you can do there is to try and use software prefetching. Which may or may not help (always test it).
So if sorting the vector of pointers makes the objects pointed by them continuously placed one after another with a relatively small stride - then yes, you will improve the memory access speed by making it more friendly for the prefetch hardware.
You also have to make sure that sorting that vector doesn't result in a worse gain/lose ratio.
On a side note, depending on how you use each element, you may want to allocate them all at once and/or split those objects into different smaller structures and iterate over smaller data chunks.
At any rate, you absolutely must measure the performance of the whole application before and after your changes. These sort of optimizations is a tricky business and things can get worse even though in theory the performance should have been improved. There are many tools that can be used to help you profile the memory access. For example, cachegrind. Intel's VTune does the same. And many other tools. So don't guess, experiment and verify the results.