I would like to store multiple matmul results as row vector into another matrix, but my current code seems to take a lot of memory space. Here is my pseudo code:
for (int i = 0; i < C_row; ++i) {
C.row(i) = (A.transpose() * B).reshaped(1, C_col);
}
In this case, C is actually a Map of pre-allocated array declared as Map<Matrix<float, -1, -1, RowMajor>> C(C_array, C_row, C_col);.
Therefore, I expect the calculated matmul results can directly go to memory space of C and do not create temporary copies. In other words, the total memory usage should be the same with or without the above code. But I found that with the above code, the memory usage is increased significantly.
I tried to use C.row(i).noalias() to directly assign results to each row of C, but there is no memory usage difference. How to make this code more efficiently by taking less memory space?
The reshaped is the culprit. It cannot be folded into the matrix multiplication so it results in a temporary allocation for the multiplication. Ideally you would need to put it onto the left of the assignment:
C.row(i).reshaped(A.cols(), B.cols()).noalias() = A.transpose() * B;
However, that does not compile. Reshaped doesn't seem to fulfil the required interface. It's a pretty new addition to Eigen, so I'm not overly surprised. You might want to open a feature request on their bug tracker.
Anyway, as a workaround, try this:
Eigen::Map<Eigen::MatrixXd> reshaped(C.row(i).data(), A.cols(), B.cols());
reshaped.noalias() = A.transpose() * B;
I know that Sparse matrix in armadillo is still in preliminary support.
I'm using armadillo lib in my quantum systems research and I have problem to construct sparse mat in effective RAM way.
So far I was using my own implementation of sparse matrixes, but I want to have an optimized matrix class.
I'm filling elements in batch mode:
umat loc(2,size);
cx_vec val(size);
// calculate loc and val
...
//
sp_cx_mat Hamiltonian(loc, val);
This kind of action copy values from loc,val to constructor of Hamiltonian and for some few seconds require 2x RAM. I calculate huge matrix (size is about 2**L, where L=22, 24, ...) so I wish I had well optimised code in memory.
For comparison, matrix size: 705432x705432 - RAM and "filling time":
my implementation (COO format): time 7.95s, memory 317668kB
armadillo (CSC format): time 5.32s, memory 715000kB
Is it possible to deallocate fragments of vectors: loc, val on the fly to save memory, element by element?
The answer here will be to use the other sparse matrix constructor that takes the CSC format, so you will need to modify your // calculate loc and val code, instead filling the following three arrays:
values (length equal to number of points)
row_indices (length equal to number of points)
col_ptrs (length equal to number of columns plus one)
The points should be arranged in column-major ordering in the values and row_indices vectors, and the col_ptrs vector contains the number of nonzero elements before the beginning of the column. That is, col_ptrs[0] will always contain 0, col_ptrs[1] will contain the number of nonzero elements in the first column, col_ptrs[2] will contain the number of nonzero elements in the first and second columns, and col_ptrs[n_cols + 1] will contain the number of nonzero elements in the matrix.
For more documentation on this constructor, see the "Batch constructors" section of http://arma.sourceforge.net/docs.html#SpMat ; this is the fourth entry in that list.
If you cannot easily modify your calculation code to adhere to that format, then you might be better off trying to specify sort_locations = false to the constructor you are using, if you are not already doing that.
I have a growing database in form of an Eigen::MatrixXd. My matrix starts empty and gets rows added one by one until it reaches a maximum predefined (known at compile time) number of rows.
At the moment I grow it like that (from the Eigen docs and many posts here and elsewhere):
MatrixXd new_database(database.rows()+1, database.cols());
new_database << database, new_row;
database = new_database;
But this seems way more inefficient than it needs to be, since it makes a lot of useless memory reallocation and data copying every time a new row is added... Seems like I should be able to pre-allocate a bunch of memory of size MAX_ROWS*N_COLS and let the matrix grow in it, however I can't find an equivalent of std::vector's capacity with Eigen.
Note: I may need to use the matrix at anytime before it is actually full. So I do need a distinction between what would be its size and what would be its capacity.
How can I do this?
EDIT 1: I see there is a MaxSizeAtCompileTime but I find the doc rather unclear with no examples. Anyone knows if this is the way to go, how to use this parameter and how it would interact with the resize and conservativeResize?
EDIT 2: C++: Eigen conservativeResize too expensive? provides another interesting approach while raising question regarding non contiguous data... Anyone has some good insight on that matter?
First thing I want to mention before I forget, is that you may want to consider using a row major matrix for storage.
The simplest (and probably best) solution to your question would be to use block operations to access the top rows.
#include <Eigen/Core>
#include <iostream>
using namespace Eigen;
int main(void)
{
const int rows = 5;
const int cols = 6;
MatrixXd database(rows, cols);
database.setConstant(-1.0);
std::cout << database << "\n\n";
for (int i = 0; i < rows; i++)
{
database.row(i) = VectorXd::Constant(cols, i);
// Use block operations instead of the full matrix
std::cout << database.topRows(i+1) << "\n\n";
}
std::cout << database << "\n\n";
return 0;
}
Instead of just printing the matrix, you could do whatever operations you require.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am writing some code that needs to be as fast as possible without sucking up all of my research time (in other words, no hand optimized assembly).
My systems primarily consist of a bunch of 3D points (atomic systems) and so the code I write does lots of distance comparisons, nearest-neighbor searches, and other types of sorting and comparisons. These are large, million or billion point systems, and the naive O(n^2) nested for loops just won't cut it.
It would be easiest for me to just use a std::vector to hold point coordinates. And at first I thought it will probably be about as fast an array, so that's great! However, this question (Is std::vector so much slower than plain arrays?) has left me with a very uneasy feeling. I don't have time to write all of my code using both arrays and vectors and benchmark them, so I need to make a good decision right now.
I am sure that someone who knows the detailed implementation behind std::vector could use those functions with very little speed penalty. However, I primarily program in C, and so I have no clue what std::vector is doing behind the scenes, and I have no clue if push_back is going to perform some new memory allocation every time I call it, or what other "traps" I could fall into that make my code very slow.
An array is simple though; I know exactly when memory is being allocated, what the order of all my algorithms will be, etc. There are no blackbox unknowns that I may have to suffer through. Yet so often I see people criticized for using arrays over vectors on the internet that I can't but help wonder if I am missing some more information.
EDIT: To clarify, someone asked "Why would you be manipulating such large datasets with arrays or vectors"? Well, ultimately, everything is stored in memory, so you need to pick some bottom layer of abstraction. For instance, I use kd-trees to hold the 3D points, but even so, the kd-tree needs to be built off an array or vector.
Also, I'm not implying that compilers cannot optimize (I know the best compilers can outperform humans in many cases), but simply that they cannot optimize better than what their constraints allow, and I may be unintentionally introducing constraints simply due to my ignorance of the implementation of vectors.
all depends on this how you implement your algorithms. std::vector is such general container concept that gives us flexibility but leaves us with freedom and responsibility of structuring implementation of algorithm deliberately. Most of the efficiency overhead that we will observe from std::vector comes from copying. std::vector provides a constructor which lets you initialize N elements with value X, and when you use that, the vector is just as fast as an array.
I did a tests std::vector vs. array described here,
#include <cstdlib>
#include <vector>
#include <iostream>
#include <string>
#include <boost/date_time/posix_time/ptime.hpp>
#include <boost/date_time/microsec_time_clock.hpp>
class TestTimer
{
public:
TestTimer(const std::string & name) : name(name),
start(boost::date_time::microsec_clock<boost::posix_time::ptime>::local_time())
{
}
~TestTimer()
{
using namespace std;
using namespace boost;
posix_time::ptime now(date_time::microsec_clock<posix_time::ptime>::local_time());
posix_time::time_duration d = now - start;
cout << name << " completed in " << d.total_milliseconds() / 1000.0 <<
" seconds" << endl;
}
private:
std::string name;
boost::posix_time::ptime start;
};
struct Pixel
{
Pixel()
{
}
Pixel(unsigned char r, unsigned char g, unsigned char b) : r(r), g(g), b(b)
{
}
unsigned char r, g, b;
};
void UseVector()
{
TestTimer t("UseVector");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.resize(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
}
}
void UseVectorPushBack()
{
TestTimer t("UseVectorPushBack");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels;
pixels.reserve(dimension * dimension);
for(int i = 0; i < dimension * dimension; ++i)
pixels.push_back(Pixel(255, 0, 0));
}
}
void UseArray()
{
TestTimer t("UseArray");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
Pixel * pixels = (Pixel *)malloc(sizeof(Pixel) * dimension * dimension);
for(int i = 0 ; i < dimension * dimension; ++i)
{
pixels[i].r = 255;
pixels[i].g = 0;
pixels[i].b = 0;
}
free(pixels);
}
}
void UseVectorCtor()
{
TestTimer t("UseConstructor");
for(int i = 0; i < 1000; ++i)
{
int dimension = 999;
std::vector<Pixel> pixels(dimension * dimension, Pixel(255, 0, 0));
}
}
int main()
{
TestTimer t1("The whole thing");
UseArray();
UseVector();
UseVectorCtor();
UseVectorPushBack();
return 0;
}
and here are results (compiled on Ubuntu amd64 with g++ -O3):
UseArray completed in 0.325 seconds
UseVector completed in 1.23 seconds
UseConstructor completed in 0.866 seconds
UseVectorPushBack completed in 8.987 seconds
The whole thing completed in 11.411 seconds
clearly push_back wasn't good choice here, using constructor is still 2 times slower than array.
Now, providing Pixel with empty copy Ctor:
Pixel(const Pixel&) {}
gives us following results:
UseArray completed in 0.331 seconds
UseVector completed in 0.306 seconds
UseConstructor completed in 0 seconds
UseVectorPushBack completed in 2.714 seconds
The whole thing completed in 3.352 seconds
So in summary: re-think your algorithm, otherwise, perhaps resort to a custom wrapper around New[]/Delete[]. In any case, the STL implementation isn't slower for some unknown reason, it just does exactly what you ask; hoping you know better.
In the case when you just started with vectors it might be surprising how they behave, for example this code:
class U{
int i_;
public:
U(){}
U(int i) : i_(i) {cout << "consting " << i_ << endl;}
U(const U& ot) : i_(ot.i_) {cout << "copying " << i_ << endl;}
};
int main(int argc, char** argv)
{
std::vector<U> arr(2,U(3));
arr.resize(4);
return 0;
}
results with:
consting 3
copying 3
copying 3
copying 548789016
copying 548789016
copying 3
copying 3
Vectors guarantee that the underlying data is a contiguous block in memory. The only sane way to guarantee this is by implementing it as an array.
Memory reallocation on pushing new elements can happen, because the vector can't know in advance how many elements you are going to add to it. But when you know it in advance, you can call reserve with the appropriate number of entries to avoid reallocation when adding them.
Vectors are usually preferred over arrays because they allow bound-checking when accessing elements with .at(). That means accessing indices outside of the vector doesn't cause undefined behavior like in an array. This bound-checking does however require additional CPU cycles. When you use the []-operator to access elements, no bound-checking is done and access should be as fast as an array. This however risks undefined behavior when your code is buggy.
People who invented STL, and then made it into the C++ standard library, are expletive deleted smart. Don't even let yourself imagine for one little moment you can outperform them because of your superior knowledge of legacy C arrays. (You would have a chance if you knew some Fortran though).
With std::vector, you can allocate all memory in one go, just like with C arrays. You can also allocate incrementally, again just like with C arrays. You can control when each allocation happens, just like with C arrays. Unlike with C arrays, you can also forget about it all and let the system manage the allocations for you, if that's what you want. This is all absolutely necessary, basic functionality. I'm not sure why anyone would assume it is missing.
Having said all that, go with arrays if you find them easier to understand.
I am not really advising you to go either for arrays or vectors, because I think that for your needs they may not be totally fit.
You need to be able to organize your data efficiently, so that queries would not need to scan the whole memory range to get the relevant data. So you want to group the points which are more likely to be selected together close to each other.
If your dataset is static, then you can do that sorting offline, and make your array nice and tidy to be loaded up in memory at your application start up time, and either vector or array would work (provided you do the reserve call up front for vector, since the default allocation growth scheme double up the size of the underlying array whenever it gets full, and you wouldn't want to use up 16Gb of memory for only 9Gb worth of data).
But if your dataset is dynamic, it will be difficult to do efficient inserts in your set with a vector or an array. Recall that each insert within the array would create a shift of all the successor elements of one place. Of course, an index, like the kd tree you mention, will help by avoiding a full scan of the array, but if the selected points are scattered accross the array, the effect on memory and cache will essentially be the same. The shift would also mean that the index needs to be updated.
My solution would be to cut the array in pages (either list linked or array indexed) and store data in the pages. That way, it would be possible to group relevant elements together, while still retaining the speed of contiguous memory access within pages. The index would then refer to a page and an offset in that page. Pages wouldn't be filled automatically, which leaves rooms to insert related elements, or make shifts really cheap operations.
Note that if pages are always full (excepted for the last one), you still have to shift every single one of them in case of an insert, while if you allow incomplete pages, you can limit a shift to a single page, and if that page is full, insert a new page right after it to contain the suplementary element.
Some things to keep in mind:
array and vector allocation have upper limits, which is OS dependent (these limits might be different)
On my 32bits system, the maximum allowed allocation for a vector of 3D points is at around 180 millions entries, so for larger datasets, on would have to find a different solution. Granted, on 64bits OS, that amount might be significantly larger (On windows 32bits, the maximum memory space for a process is 2Gb - I think they added some tricks on more advanced versions of the OS to extend that amount). Admittedly memory will be even more problematic for solutions like mine.
resizing of a vector requires allocating the new size of the heap, copy the elements from the old memory chunck to the new one.
So for adding just one element to the sequence, you will need twice the memory during the resizing. This issue may not come up with plain arrays, which can be reallocated using the ad hoc OS memory functions (realloc on unices for instance, but as far as I know that function doesn't make any guarantee that the same memory chunck will be reused). The problem might be avoided in vector as well if a custom allocator which would use the same functions is used.
C++ doesn't make any assumption about the underlying memory architecture.
vectors and arrays are meant to represent contiguous memory chunks provided by an allocator, and wrap that memory chunk with an interface to access it. But C++ doesn't know how the OS is managing that memory. In most modern OS, that memory is actually cut in pages, which are mapped in and out of physical memory. So my solution is somehow to reproduce that mechanism at the process level. In order to make the paging efficient, it is necessary to have our page fit the OS page, so a bit of OS dependent code will be necessary. On the other hand, this is not a concern at all for a vector or array based solution.
So in essence my answer is concerned by the efficiency of updating the dataset in a manner which will favor clustering points close to each others. It supposes that such clustering is possible. If not the case, then just pushing a new point at the end of the dataset would be perfectly alright.
Although I do not know the exact implementation of std:vector, most list systems like this are slower than arrays as they allocate memory when they are resized, normally double the current capacity although this is not always the case.
So if the vector contains 16 items and you added another, it needs memory for another 16 items. As vectors are contiguous in memory, this means that it will allocate a solid block of memory for 32 items and update the vector. You can get some performance improvements by constructing the std:vector with an initial capacity that is roughly the size you think your data set will be, although this isn't always an easy number to arrive at.
For operation that are common between vectors and arrays (hence not push_back or pop_back, since array are fixed in size) they perform exactly the same, because -by specification- they are the same.
vector access methods are so trivial that the simpler compiler optimization will wipe them out.
If you know in advance the size of a vector, just construct it by specifyinfg the size or just call resize, and you will get the same you can get with a new [].
If you don't know the size, but you know how much you will need to grow, just call reserve, and you get no penality on push_back, since all the required memory is already allocated.
In any case, relocation are not so "dumb": the capacity and the size of a vector are two distinct things, and the capacity is typically doubled upon exhaustion, so that relocation of big amounts are less and less frequent.
Also, in case you know everything about sizes, and you need no dynamic memory and want the same vector interface, consider also std::array.
Sounds like you need gigs of RAM so you're not paging. I tend to go along with #Philipp's answer, because you really really want to make sure it's not re-allocating under the hood
but
what's this about a tree that needs rebalancing?
and you're even thinking about compiler optimization?
Please take a crash course in how to optimize software.
I'm sure you know all about Big-O, but I bet you're used to ignoring the constant factors, right? They might be out of whack by 2 to 3 orders of magnitude, doing things you never would have thought costly.
If that translates to days of compute time, maybe it'll get interesting.
And no compiler optimizer can fix these things for you.
If you're academically inclined, this post goes into more detail.
Is there a way to do this in C++ without having things crash on runtime?
Right now I am declaring
vector<vector<int> > myvec(veclength);
How can I crank up veclength as high as it will go (properly)? Even at 10^7 it crashes when I should have more than enough computer memory.
This should take take approximately 250 MiB of space1 (or less, depending on architecture) so memory definitely isn’t the problem here, and neither should max_size, which would be in the order of 1017 (≈ 264∕8+8+8).
I should mention that I corroborated these calculations by looking at the implementations of std::vectorin GCC' libstdc++ and LLVM's libc++, and by testing on a live system. The calculated values correspond 1:1 to the real implementations, and the OP’s code works flawlessly with veclength = 10e7.
I therefore conclude that the real cause is elsewhere.
1) Calculated by approximating the size of each individual vector by three 64 bit integers to denote begin pointer, size and capacity respectively, and assuming that an empty vector has a default capacity of 0. Actual implementations may differ but probably not by much.
Based on my comment above, I think I might have a solution for you.
See max_size and test what the
max_size is on your machine. The key here is that the vector size
limit is reduced as the size of vector elements increase - and since
you have a vector of vector of int, the size of the outer vector is
likely to be quite limited.
Here is the result one person got by running the above program. Note that a vector of int (size 4) has a max size of 1073741823 which is 10^9 Your using a vector> which will take up significantly more space, and thus significantly reduce the max size.
Max elements that can be inserted into a vector having elements of size '1' is: 4294967295
Max elements that can be inserted into a vector having elements of size '4' is: 1073741823
Max elements that can be inserted into a vector having elements of size '8' is: 536870911
Max elements that can be inserted into a vector having elements of size '4' is: 1073741823
If you change your data structure to be a vector of vector<int> pointers, you'll be able to store many more. I know this may fundamentally change a lot of your corresponding functions and structures, but that's the limitations of vector.
To whom it may concern: http://codepad.org/nAoPi7cV
int main()
{
std::cout << "Max elements that can be inserted into a vector having elements of size '"
<< sizeof( std::vector<int> ) << "' is: "
<< std::vector<std::vector<int> >().max_size() << std::endl;
}
Max elements that can be inserted into a vector having elements of size '4' is: 1073741823
Max elements that can be inserted into a vector having elements of size '28' is: 153391689
log(153391689) ~= 8.2
So it's max_size is big enough to hold 10^7 on the codepad compiling machine. On lesser machines, it may not be.
Also note that even though this max size is given, the program segfaults on construction: http://codepad.org/agKMMEjQ
int main()
{
std::vector<std::vector<int> > myvec(153391689);
}
Segmentation fault
If you reduce this further to the size proposed by the asker (10^7), the program crashes yet again: http://codepad.org/zMG0VCeg
std::vector<std::vector<int> > myvec(10000000);
std::bad_alloc: St9bad_alloc
Aborted.
If you reduce your attempted size further though, the program runs happily: http://codepad.org/sbMPppgx
std::vector<std::vector<int> > myvec(100000);
std::cout << myvec.size();
100000
The above program is extremely simple and encounters the exact problems the asker specified - therefore the problem does not lie elsewhere - the limitation is encountered due to the std::vector class. Codepad links have been left in for easy reference, but I got the same numbers when testing in a local environment.