Run out of ram C++ - c++

I´m working with a 4D matrix (using STL vectors). Usually, the dimensions are different.For example, I´m reading a matrix whose dimensions are 192x256x128x96, and the following code complete with 0´s to the bigger dimension (256 in this case).
while(matriz.size() < width) //width es el tamaño de N
{
vector<vector<vector<short>>> aux;
matriz.push_back(aux);
}
for(auto i = 0; i < matriz.size(); i++)
{
while(matriz[i].size() < width)
{
vector<vector<short>> aux;
matriz[i].push_back(aux);
}
}
for(auto i = 0; i < matriz.size(); i++)
for(auto j = 0; j < matriz[i].size(); j++)
while(matriz[i][j].size() < width)
{
vector<short> aux;
matriz[i][j].push_back(aux);
}
for(auto i = 0; i < matriz.size(); i++)
for(auto j = 0; j < matriz[i].size(); j++)
for(auto k = 0; k < matriz[i][j].size(); k++)
while(matriz[i][j][k].size() < width)
{
matriz[i][j][k].push_back(0);
}
That code works with little-medium size 4D matrix, I´ve been tried with 200x200x200x200 and it really works, but I need to use it with a 256x256x256x256 matrix and when I run it my computer doesn´t respond.
I´m not sure if it is a RAM issue. My computer has 12GB RAM, and if I'm not mistaken, the size of the matrix is 8GB.
Any idea how to fix this ?
edit
If I let the program works, a time later it is killed
The memory usage with a 200x200x200x200 matrix is 56.7%

Let's see if I have this right.
You are producing:
1 vector that holds:
256 vectors that each hold
256 vectors that each hold (65,536 in total)
256 vectors that each hold (16,777,216 in total)
256 shorts (4,294,967,296 in total, or 8,589,934,592 Bytes as you indicated)
I don't know the entire size of each vector itself, but probably well under 1k, so you're using less than 10 gig of memory.
However, that's a LOT going on. Is it really hanging, or is it just taking a very, very long time.
Some debug output periodically would help answer that.

Some tips (from the comments):
Run an optimized build (-O3), this should speed up processing.
Instead of push_back() of an empty vector in a loop, use resize(). This will prevent costly reallocation.
So for example, replace
while(matriz.size() < width) //width es el tamaño de N
{
vector<vector<vector<short>>> aux;
matriz.push_back(aux);
}
With
matriz.resize(width);
If you do still need to use push_back() in a loop, at least reserve() the capacity beforehand. This can again prevent costly reallocations. Reallocating a vector can briefly double the amount of memory that it would normally use.
Use tools like top to watch memory and swap usage on the machine in real time. If you notice swap space used increasing, that means the machine is running out of memory.

Related

allocating 3 dimensional vectors of size 10000 10000 3 in c++

I'm making a picture editing program, and I'm stuck in allocating memory.
I have no idea what is going on.
Ok.. So when I do this:
std::vector<unsigned char> h;
for (int a = 0; a < 10000 * 10000 * 3; a++) {
h.push_back(0);
}
this is fine(sorry I had to), but when I do this:
std::vector<std::vector<std::vector<unsigned char>>> h;
for (uint32_t a = 0; a < 10000; a++) {
h.push_back({});
for (uint32_t b = 0; b < 10000; b++) {
h.at(a).push_back({});
for (uint32_t c = 0; c < 3; c++) {
h.at(a).at(b).push_back(0xff);
}
}
}
my memory usage explodes, and I get error: Microsoft C++ exception: std::bad_alloc at memory location 0x009CF51C
I'm working with .bmp.
Currently, code is in testing mode so it's basically a giant mess...
I'm 15, so don't expect much of me.
I was searching for solutions, but all I found was like how to handle large integers and so on...
If you can give me maybe another solution, but I want my code to be as beginner friendly as it can get.
This is due to overhead of vector<char>. Each such object with 3 elements takes not 3 bytes, but probably 4 (due to reallocation policy), plus 3 pointers which probably take 3*8=24 bytes. Overall your structure takes 9.3 times the memory it could have.
If you replace the inner vector with an array, it will start working, since array does not have this overhead:
std::vector<std::vector<std::array<unsigned char, 3>>> h;
for (uint32_t a = 0; a < 10000; a++) {
h.emplace_back();
for (uint32_t b = 0; b < 10000; b++) {
h.at(a).emplace_back();
for (auto &c : h.at(a).at(b)) {
c = 0xff;
}
}
}
Another alternative is to put the smaller dimension first.
My guess would be that the memory is being heavily fragmented by the constant vector reallocation, resulting in madness. For data this large, I would suggest simply storing a 1-dimensional pre-allocated vector:
std::vector h(10000 * 10000 * 3);
And then come up with an array accessing scheme that takes the X/Y arguments and turns them into an index in your 1d array, eg.:
int get_index(int x, int y, int width) {
return ((y * width) + x) * 3;
}
If the image size is always fixed, you can also use std::array (see multi-dimensional arrays), since the size is defined at compile-time and it won't suffer the same memory issues as the dynamically allocated vectors.
I don't know if this will help your problem, but you could try allocating the memory for the vec of vecs of vecs all at the beginning, with the constructor.
std::vector<std::vector<std::vector<unsigned char>>> h(10000, std::vector<std::vector<unsigned char>>(10000, std::vector<unsigned char>(3,0xff)));
BTW, you're getting a good start writing C++ at 15! I didn't start studying computer science till I was in my 20s. It really is a very marketable career path, and there are a lot of intellectually stimulating, challenging things to learn. Best of luck!

Why is bool [][] more efficient than vector<vector<bool>>

I am trying to solve a DP problem where I create a 2D array, and fill the 2D array all the way. My function is called multiple times with different test cases. When I use a vector>, I get a time limit exceeded error (takes more than 2 sec for all the test cases). However, when I use bool [][], takes much less time (about 0.33 sec), and I get a pass.
Can someone please help me understand why would vector> be any less efficient than bool [][].
bool findSubsetSum(const vector<uint32_t> &input)
{
uint32_t sum = 0;
for (uint32_t i : input)
sum += i;
if ((sum % 2) == 1)
return false;
sum /= 2;
#if 1
bool subsum[input.size()+1][sum+1];
uint32_t m = input.size()+1;
uint32_t n = sum+1;
for (uint32_t i = 0; i < m; ++i)
subsum[i][0] = true;
for (uint32_t j = 1; j < n; ++j)
subsum[0][j] = false;
for (uint32_t i = 1; i < m; ++i) {
for (uint32_t j = 1; j < n; ++j) {
if (j < input[i-1])
subsum[i][j] = subsum[i-1][j];
else
subsum[i][j] = subsum[i-1][j] || subsum[i-1][j - input[i-1]];
}
}
return subsum[m-1][n-1];
#else
vector<vector<bool>> subsum(input.size()+1, vector<bool>(sum+1));
for (uint32_t i = 0; i < subsum.size(); ++i)
subsum[i][0] = true;
for (uint32_t j = 1; j < subsum[0].size(); ++j)
subsum[0][j] = false;
for (uint32_t i = 1; i < subsum.size(); ++i) {
for (uint32_t j = 1; j < subsum[0].size(); ++j) {
if (j < input[i-1])
subsum[i][j] = subsum[i-1][j];
else
subsum[i][j] = subsum[i-1][j] || subsum[i-1][j - input[i-1]];
}
}
return subsum.back().back();
#endif
}
Thank you,
Ahmed.
If you need a matrix and you need to do high performance stuff, it is not always the best solution to use a nested std::vector or std::array because these are not contiguous in memory. Non contiguous memory access results in higher cache misses.
See more :
std::vector and contiguous memory of multidimensional arrays
Is the data in nested std::arrays guaranteed to be contiguous?
On the other hand bool twoDAr[M][N] is guarenteed to be contiguous. It ensures less cache misses.
See more :
C / C++ MultiDimensional Array Internals
And to know about cache friendly codes:
What is “cache-friendly” code?
Can someone please help me understand why would vector> be any less
efficient than bool [][].
A two-dimensional bool array is really just a big one-dimensional bool array of size M * N, with no gaps between the items.
A two-dimensional std::vector doesn't exist; is not just one big one-dimensional std::vector but a std::vector of std::vectors. The outer vector itself has no memory gaps, but there may well be gaps between the content areas of the individual vectors. It depends on how your compiler implements the very special std::vector<bool> class, but if your element count is sufficiently big, then dynamic allocation is unavoidable to prevent a stack overflow, and that alone implies pointers to separated memory areas.
And once you need to access data from separated memory areas, things become slower.
Here is a possible solution:
Try to use a std::vector<bool> of size (input.size() + 1) * (sum + 1).
If that fails to make things faster, avoid the template specialisation by using a std::vector<char> of size (input.size() + 1) * (sum + 1), and cast the elements to and from bool as required.
In cases where you know the input size of the array from the beginning, using arrays will always be faster than Vector or equal. Because vector is a wrapper around array, therefore a higher-level implementation. It's benefit is allocating extra space for you when needed, which you don't need if you have a fixed size of elements.
If you had problem where you needed 1D arrays, the difference may not have bothered you(where you have a single vector, and a single array). But creating a 2D array, you also create many instances of the vector class, so the time difference between array and vector is multiplied by the number of elements you have in your container, making your code slow.
This time difference has many causes behind it, but the most obvious one is of course calling the vector constructor. You are calling a function subsum.size() times. The memory issue mentioned by other answers is another cause.
For performance, it is advised to use array's whenever you can in your code. Even if you need to use a vector, you should try to minimize the number of re-size's done by the vector(reserving, pre-allocating), achieving a closer implementation to array.

filling only half of Matrix using OpenMp in C++

I have a quite big matrix. I would like to fill half of the matrix in parallel.
m_matrix is 2D std vector. Any suggestion for the type of container is appreciated as well. What _fill(i,j) function is doing is not considered heavy compared to size of the matrix.
//i: row
//j: column
for (size_t i=1; i<num_row; ++i)
{
for (size_t j=0; j<i; ++j)
{
m_matrix[i][j] = _fill(i, j);
}
}
What would be a nice openMP structure for that? I tried dynamic strategy bet I got even time increase compared to the sequential mode.

writing slower than the operation itself?

I am struggling to understand behavior of my functions.
My code is written in C++ in visual studio 2012. Running on Windows 7 64 bit. I am working with 2D arrays of float numbers. when I time my function I see that the time for function is reduced by 10X or more if I just stop writing my results to the output pointer. Does that mean that writing is slow?
Here is an example:
void TestSpeed(float** pInput, float** pOutput)
{
UINT32 y, x, i, j;
for (y = 3; y < 100-3; y++)
{
for (x = 3; x < 100-3; x++)
{
float fSum = 0;
for (i = y - 3; i <= y+3; i++)
{
for (j = x-3; j <= x+3; j++)
{
fSum += pInput[y][x]*exp(-(pInput[y][x]-pInput[i][j])*(pInput[y][x]-pInput[i][j]));
}
}
pOutput[y][x] = fSum;
}
}
}
If I comment out the line "pOutput[y][x] = fSum;" then the functions runs very quick. Why is that?
I am calling 2-3 such functions sequentially. Would it help to use stack instead of heap to write chunk of results and passing it onto next function and then write back to heap buffer after that chunk is ready?
In some cases I saw that if I replace pOutput[y][x] by a line buffer allocated on stack like,
float fResult[100] and use it to store results works faster for larger data size.
Your code makes a lot of operation and it needs time. Depending on what you are doing with the output you may consider the diagonalization or decomposition of your input matrix. Or you can look for values in yor output which are n times an other value etc and don't calculate the exponential for theese.

C++: Improving cache performance in a 3d array

I don't know how to optimize cache performance at a really low level, thinking about cache-line size or associativity. That's not something you can learn overnight. Considering my program will run on many different systems and architectures, I don't think it would be worth it anyway. But still, there are probably some steps I can take to reduce cache misses in general.
Here is a description of my problem:
I have a 3d array of integers, representing values at points in space, like [x][y][z]. Each dimension is the same size, so it's like a cube. From that I need to make another 3d array, where each value in this new array is a function of 7 parameters: the corresponding value in the original 3d array, plus the 6 indices that "touch" it in space. I'm not worried about the edges and corners of the cube for now.
Here is what I mean in C++ code:
void process3DArray (int input[LENGTH][LENGTH][LENGTH],
int output[LENGTH][LENGTH][LENGTH])
{
for(int i = 1; i < LENGTH-1; i++)
for (int j = 1; j < LENGTH-1; j++)
for (int k = 1; k < LENGTH-1; k++)
//The for loops start at 1 and stop before LENGTH-1
//or other-wise I'll get out-of-bounds errors
//I'm not concerned with the edges and corners of the
//3d array "cube" at the moment.
{
int value = input[i][j][k];
//I am expecting crazy cache misses here:
int posX = input[i+1] [j] [k];
int negX = input[i-1] [j] [k];
int posY = input[i] [j+1] [k];
int negY = input[i] [j-1] [k];
int posZ = input[i] [j] [k+1];
int negZ = input[i] [j] [k-1];
output [i][j][k] =
process(value, posX, negX, posY, negY, posZ, negZ);
}
}
However, it seems like if LENGTH is large enough, I'll get tons of cache misses when I'm fetching the parameters for process. Is there a cache-friendlier way to do this, or a better way to represent my data other than a 3d array?
And if you have the time to answer these extra questions, do I have to consider the value of LENGTH? Like it's different whether LENGTH is 20 vs 100 vs 10000. Also, would I have to do something else if I used something other than integers, like maybe a 64-byte struct?
# ildjarn:
Sorry, I did not think that the code that generates the arrays I am passing into process3DArray mattered. But if it does, I would like to know why.
int main() {
int data[LENGTH][LENGTH][LENGTH];
for(int i = 0; i < LENGTH; i++)
for (int j = 0; j < LENGTH; j++)
for (int k = 0; k < LENGTH; k++)
data[i][j][k] = rand() * (i + j + k);
int result[LENGTH][LENGTH][LENGTH];
process3DArray(data, result);
}
There's an answer to a similar question here: https://stackoverflow.com/a/7735362/6210 (by me!)
The main goal of optimizing a multi-dimensional array traversal is to make sure you visit the array such that you tend to reuse the cache lines accessed from the previous iteration step. For visiting each element of an array once and only once, you can do this just by visiting in memory order (as you are doing in your loop).
Since you are doing something more complicated than a simple element traversal (visiting an element plus 6 neighbors), you need to break up your traversal such that you don't access too many cache lines at once. Since the cache thrashing is dominated by traversing along j and k, you just need to modify the traversal such that you visit blocks at a time rather than rows at a time.
E.g.:
const int CACHE_LINE_STEP= 8;
void process3DArray (int input[LENGTH][LENGTH][LENGTH],
int output[LENGTH][LENGTH][LENGTH])
{
for(int i = 1; i < LENGTH-1; i++)
for (int k_start = 1, k_next= CACHE_LINE_STEP; k_start < LENGTH-1; k_start= k_next; k_next+= CACHE_LINE_STEP)
{
int k_end= min(k_next, LENGTH - 1);
for (int j = 1; j < LENGTH-1; j++)
//The for loops start at 1 and stop before LENGTH-1
//or other-wise I'll get out-of-bounds errors
//I'm not concerned with the edges and corners of the
//3d array "cube" at the moment.
{
for (int k= k_start; k<k_end; ++k)
{
int value = input[i][j][k];
//I am expecting crazy cache misses here:
int posX = input[i+1] [j] [k];
int negX = input[i-1] [j] [k];
int posY = input[i] [j+1] [k];
int negY = input[i] [j-1] [k];
int posZ = input[i] [j] [k+1];
int negZ = input[i] [j] [k-1];
output [i][j][k] =
process(value, posX, negX, posY, negY, posZ, negZ);
}
}
}
}
What this does in ensure that you don't thrash the cache by visiting the grid in a block oriented fashion (actually, more like a fat column oriented fashion bounded by the cache line size). It's not perfect as there are overlaps that cross cache lines between columns, but you can tweak it to make it better.
The most important thing you already have right. If you were using Fortran, you'd be doing it exactly wrong, but that's another story. What you have right is that you are processing in the inner loop along the direction where memory addresses are closest together. A single memory fetch (beyond the cache) will pull in multiple values, corresponding to a series of adjacent values of k. Inside your loop the cache will contain some number of values from i,j; a similar number from i+/-1, j and from i,j+/-1. So you basically have five disjoint sections of memory active. For small values of LENGTH these will only be 1 or three sections of memory. It is in the nature of how caches are built that you can have more than this many disjoint sections of memory in your active set.
I hope process() is small, and inline. Otherwise this may well be insignificant. Also, it will affect whether your code fits in the instruction cache.
Since you're interested in performance, it is almost always better to initialize five pointers (you only need one for value, posZ and negZ), and then take *(p++) inside the loop.
input[i+1] [j] [k];
is asking the compiler to generate 3 adds and two multiplies, unless you have a very good optimizer. If your compiler is particularly lazy about register allocation, you also get four memory accesses; otherwise one.
*inputIplusOneJK++
is asking for one add and a memory reference.