I am trying to solve a DP problem where I create a 2D array, and fill the 2D array all the way. My function is called multiple times with different test cases. When I use a vector>, I get a time limit exceeded error (takes more than 2 sec for all the test cases). However, when I use bool [][], takes much less time (about 0.33 sec), and I get a pass.
Can someone please help me understand why would vector> be any less efficient than bool [][].
bool findSubsetSum(const vector<uint32_t> &input)
{
uint32_t sum = 0;
for (uint32_t i : input)
sum += i;
if ((sum % 2) == 1)
return false;
sum /= 2;
#if 1
bool subsum[input.size()+1][sum+1];
uint32_t m = input.size()+1;
uint32_t n = sum+1;
for (uint32_t i = 0; i < m; ++i)
subsum[i][0] = true;
for (uint32_t j = 1; j < n; ++j)
subsum[0][j] = false;
for (uint32_t i = 1; i < m; ++i) {
for (uint32_t j = 1; j < n; ++j) {
if (j < input[i-1])
subsum[i][j] = subsum[i-1][j];
else
subsum[i][j] = subsum[i-1][j] || subsum[i-1][j - input[i-1]];
}
}
return subsum[m-1][n-1];
#else
vector<vector<bool>> subsum(input.size()+1, vector<bool>(sum+1));
for (uint32_t i = 0; i < subsum.size(); ++i)
subsum[i][0] = true;
for (uint32_t j = 1; j < subsum[0].size(); ++j)
subsum[0][j] = false;
for (uint32_t i = 1; i < subsum.size(); ++i) {
for (uint32_t j = 1; j < subsum[0].size(); ++j) {
if (j < input[i-1])
subsum[i][j] = subsum[i-1][j];
else
subsum[i][j] = subsum[i-1][j] || subsum[i-1][j - input[i-1]];
}
}
return subsum.back().back();
#endif
}
Thank you,
Ahmed.
If you need a matrix and you need to do high performance stuff, it is not always the best solution to use a nested std::vector or std::array because these are not contiguous in memory. Non contiguous memory access results in higher cache misses.
See more :
std::vector and contiguous memory of multidimensional arrays
Is the data in nested std::arrays guaranteed to be contiguous?
On the other hand bool twoDAr[M][N] is guarenteed to be contiguous. It ensures less cache misses.
See more :
C / C++ MultiDimensional Array Internals
And to know about cache friendly codes:
What is “cache-friendly” code?
Can someone please help me understand why would vector> be any less
efficient than bool [][].
A two-dimensional bool array is really just a big one-dimensional bool array of size M * N, with no gaps between the items.
A two-dimensional std::vector doesn't exist; is not just one big one-dimensional std::vector but a std::vector of std::vectors. The outer vector itself has no memory gaps, but there may well be gaps between the content areas of the individual vectors. It depends on how your compiler implements the very special std::vector<bool> class, but if your element count is sufficiently big, then dynamic allocation is unavoidable to prevent a stack overflow, and that alone implies pointers to separated memory areas.
And once you need to access data from separated memory areas, things become slower.
Here is a possible solution:
Try to use a std::vector<bool> of size (input.size() + 1) * (sum + 1).
If that fails to make things faster, avoid the template specialisation by using a std::vector<char> of size (input.size() + 1) * (sum + 1), and cast the elements to and from bool as required.
In cases where you know the input size of the array from the beginning, using arrays will always be faster than Vector or equal. Because vector is a wrapper around array, therefore a higher-level implementation. It's benefit is allocating extra space for you when needed, which you don't need if you have a fixed size of elements.
If you had problem where you needed 1D arrays, the difference may not have bothered you(where you have a single vector, and a single array). But creating a 2D array, you also create many instances of the vector class, so the time difference between array and vector is multiplied by the number of elements you have in your container, making your code slow.
This time difference has many causes behind it, but the most obvious one is of course calling the vector constructor. You are calling a function subsum.size() times. The memory issue mentioned by other answers is another cause.
For performance, it is advised to use array's whenever you can in your code. Even if you need to use a vector, you should try to minimize the number of re-size's done by the vector(reserving, pre-allocating), achieving a closer implementation to array.
Related
I´m working with a 4D matrix (using STL vectors). Usually, the dimensions are different.For example, I´m reading a matrix whose dimensions are 192x256x128x96, and the following code complete with 0´s to the bigger dimension (256 in this case).
while(matriz.size() < width) //width es el tamaño de N
{
vector<vector<vector<short>>> aux;
matriz.push_back(aux);
}
for(auto i = 0; i < matriz.size(); i++)
{
while(matriz[i].size() < width)
{
vector<vector<short>> aux;
matriz[i].push_back(aux);
}
}
for(auto i = 0; i < matriz.size(); i++)
for(auto j = 0; j < matriz[i].size(); j++)
while(matriz[i][j].size() < width)
{
vector<short> aux;
matriz[i][j].push_back(aux);
}
for(auto i = 0; i < matriz.size(); i++)
for(auto j = 0; j < matriz[i].size(); j++)
for(auto k = 0; k < matriz[i][j].size(); k++)
while(matriz[i][j][k].size() < width)
{
matriz[i][j][k].push_back(0);
}
That code works with little-medium size 4D matrix, I´ve been tried with 200x200x200x200 and it really works, but I need to use it with a 256x256x256x256 matrix and when I run it my computer doesn´t respond.
I´m not sure if it is a RAM issue. My computer has 12GB RAM, and if I'm not mistaken, the size of the matrix is 8GB.
Any idea how to fix this ?
edit
If I let the program works, a time later it is killed
The memory usage with a 200x200x200x200 matrix is 56.7%
Let's see if I have this right.
You are producing:
1 vector that holds:
256 vectors that each hold
256 vectors that each hold (65,536 in total)
256 vectors that each hold (16,777,216 in total)
256 shorts (4,294,967,296 in total, or 8,589,934,592 Bytes as you indicated)
I don't know the entire size of each vector itself, but probably well under 1k, so you're using less than 10 gig of memory.
However, that's a LOT going on. Is it really hanging, or is it just taking a very, very long time.
Some debug output periodically would help answer that.
Some tips (from the comments):
Run an optimized build (-O3), this should speed up processing.
Instead of push_back() of an empty vector in a loop, use resize(). This will prevent costly reallocation.
So for example, replace
while(matriz.size() < width) //width es el tamaño de N
{
vector<vector<vector<short>>> aux;
matriz.push_back(aux);
}
With
matriz.resize(width);
If you do still need to use push_back() in a loop, at least reserve() the capacity beforehand. This can again prevent costly reallocations. Reallocating a vector can briefly double the amount of memory that it would normally use.
Use tools like top to watch memory and swap usage on the machine in real time. If you notice swap space used increasing, that means the machine is running out of memory.
I'm making a picture editing program, and I'm stuck in allocating memory.
I have no idea what is going on.
Ok.. So when I do this:
std::vector<unsigned char> h;
for (int a = 0; a < 10000 * 10000 * 3; a++) {
h.push_back(0);
}
this is fine(sorry I had to), but when I do this:
std::vector<std::vector<std::vector<unsigned char>>> h;
for (uint32_t a = 0; a < 10000; a++) {
h.push_back({});
for (uint32_t b = 0; b < 10000; b++) {
h.at(a).push_back({});
for (uint32_t c = 0; c < 3; c++) {
h.at(a).at(b).push_back(0xff);
}
}
}
my memory usage explodes, and I get error: Microsoft C++ exception: std::bad_alloc at memory location 0x009CF51C
I'm working with .bmp.
Currently, code is in testing mode so it's basically a giant mess...
I'm 15, so don't expect much of me.
I was searching for solutions, but all I found was like how to handle large integers and so on...
If you can give me maybe another solution, but I want my code to be as beginner friendly as it can get.
This is due to overhead of vector<char>. Each such object with 3 elements takes not 3 bytes, but probably 4 (due to reallocation policy), plus 3 pointers which probably take 3*8=24 bytes. Overall your structure takes 9.3 times the memory it could have.
If you replace the inner vector with an array, it will start working, since array does not have this overhead:
std::vector<std::vector<std::array<unsigned char, 3>>> h;
for (uint32_t a = 0; a < 10000; a++) {
h.emplace_back();
for (uint32_t b = 0; b < 10000; b++) {
h.at(a).emplace_back();
for (auto &c : h.at(a).at(b)) {
c = 0xff;
}
}
}
Another alternative is to put the smaller dimension first.
My guess would be that the memory is being heavily fragmented by the constant vector reallocation, resulting in madness. For data this large, I would suggest simply storing a 1-dimensional pre-allocated vector:
std::vector h(10000 * 10000 * 3);
And then come up with an array accessing scheme that takes the X/Y arguments and turns them into an index in your 1d array, eg.:
int get_index(int x, int y, int width) {
return ((y * width) + x) * 3;
}
If the image size is always fixed, you can also use std::array (see multi-dimensional arrays), since the size is defined at compile-time and it won't suffer the same memory issues as the dynamically allocated vectors.
I don't know if this will help your problem, but you could try allocating the memory for the vec of vecs of vecs all at the beginning, with the constructor.
std::vector<std::vector<std::vector<unsigned char>>> h(10000, std::vector<std::vector<unsigned char>>(10000, std::vector<unsigned char>(3,0xff)));
BTW, you're getting a good start writing C++ at 15! I didn't start studying computer science till I was in my 20s. It really is a very marketable career path, and there are a lot of intellectually stimulating, challenging things to learn. Best of luck!
According to Visual Studio's performance analyzer, the following function is consuming what seems to me to be an abnormally large amount of processor power, seeing as all it does is add between 1 and 3 numbers from several vectors and store the result in one of those vectors.
//Relevant class members:
//vector<double> cache (~80,000);
//int inputSize;
//Notes:
//RealFFT::real is a typedef for POD double.
//RealFFT::RealSet is a wrapper class for a c-style array of RealFFT::real.
//This is because of the FFT library I'm using (FFTW).
//It's bracket operator is overloaded to return a const reference to the appropriate array element
vector<RealFFT::real> Convolver::store(vector<RealFFT::RealSet>& data)
{
int cr = inputSize; //'cache' read position
int cw = 0; //'cache' write position
int di = 0; //index within 'data' vector (ex. data[di])
int bi = 0; //index within 'data' element (ex. data[di][bi])
int blockSize = irBlockSize();
int dataSize = data.size();
int cacheSize = cache.size();
//Basically, this takes the existing values in 'cache', sums them with the
//values in 'data' at the appropriate positions, and stores them back in
//the cache at a new position.
while (cw < cacheSize)
{
int n = 0;
if (di < dataSize)
n = data[di][bi];
if (di > 0 && bi < inputSize)
n += data[di - 1][blockSize + bi];
if (++bi == blockSize)
{
di++;
bi = 0;
}
if (cr < cacheSize)
n += cache[cr++];
cache[cw++] = n;
}
//Take the first 'inputSize' number of values and return them to a new vector.
return Common::vecTake<RealFFT::real>(inputSize, cache, 0);
}
Granted, the vectors in question have sizes of around 80,000 items, but by comparison, a function which multiplies similar vectors of complex numbers (complex multiplication requires 4 real multiplications and 2 additions each) consumes about 1/3 the processor power.
Perhaps it has something to with the fact it has to jump around within the vectors rather then just accessing them linearly? I really have no idea though. Any thoughts on how this could be optimized?
Edit: I should mention I also tried writing the function to access each vector linearly, but this requires more total iterations and actually the performance was worse that way.
Turn on compiler optimization as appropriate. A guide for MSVC is here:
http://msdn.microsoft.com/en-us/library/k1ack8f1.aspx
Eigen is a well known matrix Library in c++. I am having trouble finding an in built function to simply push an item on to the end of a matrix. Currently I know that it can be done like this:
Eigen::MatrixXd matrix(10, 3);
long int count = 0;
long int topCount = 10;
for (int i = 0; i < listLength; ++i) {
matrix(count, 0) = list.x;
matrix(count, 1) = list.y;
matrix(count, 2) = list.z;
count++;
if (count == topCount) {
topCount *= 2;
matrix.conservativeResize(topCount, 3);
}
}
matrix.conservativeResize(count, 3);
And this will work (some of the syntax may be out). But its pretty convoluted for a simple thing to do. Is there already an in built function?
There is no such function for Eigen matrices. The reason for this is such a function would either be very slow or use excessive memory.
For a push_back function to not be prohibitively expensive it must increase the matrix's capacity by some factor when it runs out of space as you have done. However when dealing with matrices, memory usage is often a concern so having a matrix's capacity be larger than necessary could be problematic.
If it instead increased the size by rows() or cols() each time the operation would be O(n*m). Doing this to fill an entire matrix would be O(n*n*m*m) which for even moderately sized matrices would be quite slow.
Additionally, in linear algebra matrix and vector sizes are nearly always constant and known beforehand. Often when resizeing a matrix you don't care about the previous values in the matrix. This is why Eigen's resize function does not retain old values, unlike std::vector's resize.
The only case I can think of where you wouldn't know the matrix's size beforehand is when reading from a file. In this case I would either load the data first into a standard container such as std::vector using push_back and then copy it into an already sized matrix, or if memory is tight run through the file once to get the size and then a second time to copy the values.
There is no such function, however, you can build something like this yourself:
using Eigen::MatrixXd;
using Eigen::Vector3d;
template <typename DynamicEigenMatrix>
void push_back(DynamicEigenMatrix& m, Vector3d&& values, std::size_t row)
{
if(row >= m.rows()) {
m.conservativeResize(row + 1, Eigen::NoChange);
}
m.row(row) = values;
}
int main()
{
MatrixXd matrix(10, 3);
for (std::size_t i = 0; i < 10; ++i) {
push_back(matrix, Vector3d(1,2,3), i);
}
std::cout << matrix << "\n";
return 0;
}
If this needs to perform too many resizes though, it's going to be horrendously slow.
I don't know how to optimize cache performance at a really low level, thinking about cache-line size or associativity. That's not something you can learn overnight. Considering my program will run on many different systems and architectures, I don't think it would be worth it anyway. But still, there are probably some steps I can take to reduce cache misses in general.
Here is a description of my problem:
I have a 3d array of integers, representing values at points in space, like [x][y][z]. Each dimension is the same size, so it's like a cube. From that I need to make another 3d array, where each value in this new array is a function of 7 parameters: the corresponding value in the original 3d array, plus the 6 indices that "touch" it in space. I'm not worried about the edges and corners of the cube for now.
Here is what I mean in C++ code:
void process3DArray (int input[LENGTH][LENGTH][LENGTH],
int output[LENGTH][LENGTH][LENGTH])
{
for(int i = 1; i < LENGTH-1; i++)
for (int j = 1; j < LENGTH-1; j++)
for (int k = 1; k < LENGTH-1; k++)
//The for loops start at 1 and stop before LENGTH-1
//or other-wise I'll get out-of-bounds errors
//I'm not concerned with the edges and corners of the
//3d array "cube" at the moment.
{
int value = input[i][j][k];
//I am expecting crazy cache misses here:
int posX = input[i+1] [j] [k];
int negX = input[i-1] [j] [k];
int posY = input[i] [j+1] [k];
int negY = input[i] [j-1] [k];
int posZ = input[i] [j] [k+1];
int negZ = input[i] [j] [k-1];
output [i][j][k] =
process(value, posX, negX, posY, negY, posZ, negZ);
}
}
However, it seems like if LENGTH is large enough, I'll get tons of cache misses when I'm fetching the parameters for process. Is there a cache-friendlier way to do this, or a better way to represent my data other than a 3d array?
And if you have the time to answer these extra questions, do I have to consider the value of LENGTH? Like it's different whether LENGTH is 20 vs 100 vs 10000. Also, would I have to do something else if I used something other than integers, like maybe a 64-byte struct?
# ildjarn:
Sorry, I did not think that the code that generates the arrays I am passing into process3DArray mattered. But if it does, I would like to know why.
int main() {
int data[LENGTH][LENGTH][LENGTH];
for(int i = 0; i < LENGTH; i++)
for (int j = 0; j < LENGTH; j++)
for (int k = 0; k < LENGTH; k++)
data[i][j][k] = rand() * (i + j + k);
int result[LENGTH][LENGTH][LENGTH];
process3DArray(data, result);
}
There's an answer to a similar question here: https://stackoverflow.com/a/7735362/6210 (by me!)
The main goal of optimizing a multi-dimensional array traversal is to make sure you visit the array such that you tend to reuse the cache lines accessed from the previous iteration step. For visiting each element of an array once and only once, you can do this just by visiting in memory order (as you are doing in your loop).
Since you are doing something more complicated than a simple element traversal (visiting an element plus 6 neighbors), you need to break up your traversal such that you don't access too many cache lines at once. Since the cache thrashing is dominated by traversing along j and k, you just need to modify the traversal such that you visit blocks at a time rather than rows at a time.
E.g.:
const int CACHE_LINE_STEP= 8;
void process3DArray (int input[LENGTH][LENGTH][LENGTH],
int output[LENGTH][LENGTH][LENGTH])
{
for(int i = 1; i < LENGTH-1; i++)
for (int k_start = 1, k_next= CACHE_LINE_STEP; k_start < LENGTH-1; k_start= k_next; k_next+= CACHE_LINE_STEP)
{
int k_end= min(k_next, LENGTH - 1);
for (int j = 1; j < LENGTH-1; j++)
//The for loops start at 1 and stop before LENGTH-1
//or other-wise I'll get out-of-bounds errors
//I'm not concerned with the edges and corners of the
//3d array "cube" at the moment.
{
for (int k= k_start; k<k_end; ++k)
{
int value = input[i][j][k];
//I am expecting crazy cache misses here:
int posX = input[i+1] [j] [k];
int negX = input[i-1] [j] [k];
int posY = input[i] [j+1] [k];
int negY = input[i] [j-1] [k];
int posZ = input[i] [j] [k+1];
int negZ = input[i] [j] [k-1];
output [i][j][k] =
process(value, posX, negX, posY, negY, posZ, negZ);
}
}
}
}
What this does in ensure that you don't thrash the cache by visiting the grid in a block oriented fashion (actually, more like a fat column oriented fashion bounded by the cache line size). It's not perfect as there are overlaps that cross cache lines between columns, but you can tweak it to make it better.
The most important thing you already have right. If you were using Fortran, you'd be doing it exactly wrong, but that's another story. What you have right is that you are processing in the inner loop along the direction where memory addresses are closest together. A single memory fetch (beyond the cache) will pull in multiple values, corresponding to a series of adjacent values of k. Inside your loop the cache will contain some number of values from i,j; a similar number from i+/-1, j and from i,j+/-1. So you basically have five disjoint sections of memory active. For small values of LENGTH these will only be 1 or three sections of memory. It is in the nature of how caches are built that you can have more than this many disjoint sections of memory in your active set.
I hope process() is small, and inline. Otherwise this may well be insignificant. Also, it will affect whether your code fits in the instruction cache.
Since you're interested in performance, it is almost always better to initialize five pointers (you only need one for value, posZ and negZ), and then take *(p++) inside the loop.
input[i+1] [j] [k];
is asking the compiler to generate 3 adds and two multiplies, unless you have a very good optimizer. If your compiler is particularly lazy about register allocation, you also get four memory accesses; otherwise one.
*inputIplusOneJK++
is asking for one add and a memory reference.