Why Eigen limits size on the stack? - c++

I have recently found that Eigen limits the size of static matrices with EIGEN_STACK_ALLOCATION_LIMIT (to 128kB).
What are the reasons for this limit?

A few reasons come to mind:
Eigen is often used in multithreaded code. Threads other than the main thread usually have a fixed stack size. 2-8 MiB are common limits. In addition to that, some Eigen, BLAS and LAPACK routines use alloca internally. All of that combined doesn't leave much room until the application crashes with such large matrices
There is very little benefit. Allocation cost will be dwarfed by whatever you do with it at such a size.
There are potential hidden costs. Just like std::array, constant time move construction / move assignment are not possible. Consider this simple example:
using Matrix = Eigen::Matrix<double, 128, 128>;
Matrix fill();
void foo()
{
Matrix x;
x = fill();
}
You might think that you assign directly to x and thanks to copy-elision, there is no extra cost. But in reality, the compiler allocates stack space of a temporary matrix. Then fill() stores its result in there. Then that result is copied to x. Copy-elision cannot work in such a case because the return value needs to be alias-free.
With a dynamic matrix, we would simply swap some pointers and be done with it.

Related

How to create a runtime variable-size array efficiently in C++?

Our library has a lot of chained functions that are called thousands of times when solving an engineering problem on a mesh every time step during a simulation. In these functions, we must create arrays whose sizes are only known at runtime, depending on the application. There are three choices we have tried so far, as shown below:
void compute_something( const int& n )
{
double fields1[n]; // Option 1.
auto *fields2 = new double[n]; // Option 2.
std::vector<double> fields3(n); // Option 3.
// .... a lot more operations on the field variables ....
}
From these choices, Option 1 has worked with our current compiler, but we know it's not safe because we may overflow the stack (plus, it's non standard). Option 2 and Option 3 are, on the other hand, safer, but using them as frequently as we do, is impacting the performance in our applications to the point that the code runs ~6 times slower than using Option 1.
What are other options to handle memory allocation efficiently for dynamic-sized arrays in C++? We have considered constraining the parameter n, so that we can provide the compiler with an upper bound on the array size (and optimization would follow); however, in some functions, n can be pretty much arbitrary and it's hard to come up with a precise upper bound. Is there a way to circumvent the overhead in dynamic memory allocation? Any advice would be greatly appreciated.
Create a cache at startup and pre-allocate with a reasonable size.
Pass the cache to your compute function or make it part of your class if compute() is a method
Resize the cache
std::vector<double> fields;
fields.reserve( reasonable_size );
...
void compute( int n, std::vector<double>& fields ) {
fields.resize(n);
// .... a lot more operations on the field variables ....
}
This has a few benefits.
First, most of the time the size of the vector will be changed but no allocation will take place due to the exponential nature of std::vector's memory management.
Second, you will be reusing the same memory so it will be likely it will stay in cache.

C++ Eigen Matrix Operations vs. Memory Allocation Performance

I have an algorithm that requires the construction of an NxN matrix inside a function that will return the product of this matrix with an Nx1 vector that's also built on the fly.
(N is usually 8 or 9, but must be generalized for values greater than that).
I'm using the Eigen library for performing algebraic operations that are even more complex (least squares and several other constrained problems), so switching it isn't an option.
I've benchmarked the functions, and there's a huge bottleneck due to the intensive memory
allocations. I aim to build a thread safe application, so, for some cases, I replaced these matrices and vectors with references to elements from a global vector that serves as a provider for objects that cannot be stored on the stack. This avoids calling the constructors/destructors of the Eigen matrices and vectors, but it's not an elegant solution and it can lead to huge problems if considerable care is not taken.
As such, does Eigen either offer a workaround because I don't see the option of passing an allocator as a template argument for these objects, OR is there a more obvious thing to do?
You can manage your own memory in a way that fits your needs and use Eigen::Map instead of Eigen::Matrix to perform calculations with it. Just make sure the data is aligned properly or notify Eigen if it isn't.
See the reference Eigen::Map for details.
Here is short example:
#include <iostream>
#include <Eigen/Core>
int main() {
int mydata[3 * 4]; // Manage your own memory as you see fit
int* data_ptr = mydata;
Eigen::Map<Eigen::MatrixXi, Eigen::Unaligned> mymatrix(data_ptr, 3, 4);
// use mymatrix like you would any another matrix
mymatrix = Eigen::MatrixXi::Zero(3, 4);
std::cout << mymatrix << '\n';
// This line will trigger a failed assertion in debug mode
// To change it see
// http://eigen.tuxfamily.org/dox-devel/TopicAssertions.html
mymatrix = Eigen::MatrixXi::Ones(3, 6);
std::cout << mymatrix << '\n';
}
To gather my comments into a full idea. Here is how I would try to do it.
Because the memory allocation in eigen is a pretty advanced stuff IMO and they do not expose much places to tap into it. The best bet is to wrap eigen objects itself into some kind of resource manager, like OP did.
I would make it a simple bin, that hold Matrix< Scalar, Dynamic, Dynamic> objects. This way you template the Scalar type and have a manager for generalized size matrices.
Whenever you call for an object, you check if you have a free object of the desired size, you return reference to it. If not, you allocate a new one. Simple. when you want to release the object, then you mark it free in the resource manager. I don't think anything more complicated is needed, but of course it's possible to implement some more sophisticated logic.
To ensure thread safety I would put a lock in the manager. Initialize it in the constructor if needed. Of course locking on free and allocate would be needed.
However depending on the work schedule. If the threads work on their own arrays I would consider to make one resource manager instance for each thread, so they don't clock each other. The thing is, that a global lock or a global manager would possibly get exhausted if you have like 12 cores working heavy on allocations/deallocations, and effectively serialize your app thourgh this one lock.
You can try replacing your default memory allocator with jemalloc or tcmalloc. It's pretty easy to try out thanks to the LD_PRELOAD mechanism.
https://github.com/jemalloc/jemalloc/wiki/Getting-Started
http://goog-perftools.sourceforge.net/doc/tcmalloc.html
C++ memory allocation mechanism performance comparison (tcmalloc vs. jemalloc)
what're the differences between tcmalloc/jemalloc and memory pool
I think it works for most C++ projects as well.
You could allocate memory for some common matrix sizes before calling that function with operator new or operator new[], store the void* pointers somewhere and let the function itself retrieve an memory block with the right size. After that, you can use placement new for matrix construction. Details are given in More effective C++, item 8.

Is returning elements slower than sending them by reference and modifying there?

Suppose I have a function that produces a big structure (in this case, a huge std::vector), and a loop that calls it repeatedly:
std::vector<int> render(int w, int h, int time){
std::vector<int> result;
/* heavyweight drawing procedures */
return result;
};
while(loop){
std::vector<int> image = render(800,600,time);
/*send image to graphics card*/
/*...*/
};
My question is: in cases like that, is GCC/Clang smart enough to avoid allocating memory for that huge 800x600x4 array on every iteration? In other words, does this code perform similar to:
void render(int w, int h, int time, std::vector<int>& image){ /*...*/ }
std::vector<int> image;
while(loop){
render(800,600,time,image);
/*...*/
}
Why the question: I'm making a compiler from a language to C++ and I have to decide which way I go; if I compile it like the first example or like the last one. The first one would be trivial; the last one would need some tricky coding, but could be worth if it is considerably faster.
Returning all but the most trivial of objects by value will be slower 99% of the time. The amount of work to construct a copy of the entire std::vector<int>, if the length of the vector is unbounded, will be substantial. Also this is a good way to potential underflow your stack, if say your vector ends up with 1,000,000 elements in it. In your first example, the image vector will also be copy constructed and destructed each time through your loop. You can always compile your code w/the -pg option to turn on gprof data and check your results.
The biggest problem is not the allocation of memory, it's the copying of the entire vector that happens at return. So the second options is much better. In your second example you are also re-using the same vector, which will not allocate memory for each iteration (unless you do image.swap(smth) at some point).
The compiler can help with copy elision, but this is not the major issue here. You could also eliminate that copy explicitly by inlining that function (you may read about rvalue references and move semantics for additional info)
The actual problem might not be solved by the compiler. Even though just one vector instance exists at a time, there would always be the overhead of properly allocating and deallocating the heap memory of that temporary vector on construction and destruction. How this performs would then solely depend upon the underlying allocator implementation (std::cllocator, new, malloc(),...) of the standard library. The allocator could be smart and preserve that memory for quick reuse, but maybe, it is not (apart from the fact, that you could replace the vector's allocator with a custom, smart one). Furthermore, it also depends on the allocated memory size, available physical memory and OS. Large blocks (relative to total memory) would be returned earlier. Linux could do over commit (giving more memory than actually available). But since the vector implementation or your renderer, respectively initializes (uses) all memory by default, it is of no use here.
--> go for 2.

shortening the period of two dimensional dynamic array algorithms in C++

I defined two dimensional dynamic array and allocate memory for arrays.Dimensions of arrays are same as each other(256*256):
double **I1,**I2;
int M=256;
int N=256;
int i,j;
I1= new double *[M+1];
for(i=1;i<=M;i++)
{I1[i]=new double [N+1];}
I2= new double *[M+1];
for(i=1;i<=M;i++)
{I2[i]=new double [N+1];}
Then,I assigned values elements of arrays.I have to execute mathematical algorithms on these arrays .I used a lot of for loops.And my code worked very very slowly.
For example if I subtract I2 from I1 and asssigned substract array to another I3 two dimensional array,I used this code:
double **I3;
double temp;
//allocate I3;
I3= new double *[M+1];
for(i=1;i<=M;i++)
{I3[i]=new double [N+1];}
//I3=I1-I2
for(i=1;i<=M;i++){
for(j=1;j<=N;j++){
temp=I1[i][j]-I2[i][j];
I3[i][j]=temp;}
}
How can I short execution time of C++ without using for loops ?
Could you advise me another methods please?
Best Regards..
First of all, in most cases I would advise against manually managing your memory like this. I'm sure you have heard that C++ offers container classes to which "algorithms" can be applied. These containers are less error prone (especially in the case of exceptions), the operations are more expressive, optimized and usually well-tested, so proven to work.
In your case, with the size of array known before, a std::vector can be used with no performance loss (except at creation), since the memory is guaranteed to be continuous and can thus be used like an array. You should also think about flattening your array, calling an allocation routine in a loop is not exactly speedy - allocation is costly. When doing matrix multiplication, consider allocation in row-major / column-major pairs, this helps caching...but I digress.
This is only a general advice, though - I am not advising you to re-implement this using containers, I just felt the need to mention them.
In this specific case, since you mentioned you want to "execute mathematical algorithms" I would suggest you have a look at a numeric library that is able to do matrix / vector operations, as this seems to be what you are after.
For C++, there is Newmat for example, and the (more or less) canonical BLAS/LAPACK implementations (i.e. Netlib, AMD's ACML, ATLAS). These allow you to perform common (and not so common) operations like adding/subtracting vectors, multiplying matrices etc. much faster, both using optimized algorithms and also optimizations as SIMD instructions your processor might offer (i.e. SSE).
Obviously, there is no way to avoid iterating over these values when doing computations, but you can do it in an optimized manner and with a standard interface.
In order of importance:
Switch on compiler optimization.
Allocate a single array for each matrix and use something like M*i+j for indexing. This will allocate faster and perhaps more importantly be more compact and less fragmented than multiple allocations.
Get used to indexing starting by zero, this will save you one array element and in general zero comparisons have the potential to be faster.
I see nothing wrong in using for loops.
If you are willing to spend even more effort, you could either use a vectorized 3rd party linear algebra lib or vectorize yourself by using things like SSE* or GPUs.
Some architectures have hardware support for vector arithmetic, such that a single instruction will sum all the elements of an array of doubles.
However, the first thing you must do to speed up a program is measure it. Have you timed your program to see where the slowdown occurs?
For example, one thing you appear to be doing in a for loop is lots of heap allocation, which tends to be slow. You could combine all your arrays into one array for greater speed.
You are currently doing the logical equivalent of this:
I3 = I1 - I2;
If you did this:
I1 -= I2;
Now I1 would be storing the result. This would destroy the original value of I1, but would avoid allocating a new array-of-arrays.
Also the intention of C++ is that you define classes to represent a data type and the operations on it. So you could write a class to represent your dynamic array storage. Or use an existing one - check out the uBLAS library.
I don't understand why you say that this is very slow. You're doing 256*256 subtractions here. I don't think there is a way to avoid for loops here (even if you're using a matrix library it will probably still do the same).
You might consider allocating 256*256 floats in one go instead of calling new 256 times (and then use some indexing arithmetic because you have only one index) but then it's probably easier to find a matrix library which does this for you.
Everything is already in STL, use valarray.
See also: How can I use a std::valarray to store/manipulate a contiguous 2D array?

Optimising C++ 2-D arrays

I need a way to represent a 2-D array (a dense matrix) of doubles in C++, with absolute minimum accessing overhead.
I've done some timing on various linux/unix machines and gcc versions. An STL vector of vectors, declared as:
vector<vector<double> > matrix(n,vector<double>(n));
and accessed through matrix[i][j] is between 5% and 100% slower to access than an array declared as:
double *matrix = new double[n*n];
accessed through an inlined index function matrix[index(i,j)], where index(i,j) evaluates to i+n*j. Other ways of arranging a 2-D array without STL - an array of n pointers to the start of each row, or defining the whole thing on the stack as a constant size matrix[n][n] - run at almost exactly the same speed as the index function method.
Recent GCC versions (> 4.0) seem to be able to compile the STL vector-of-vectors to nearly the same efficiency as the non-STL code when optimisations are turned on, but this is somewhat machine-dependent.
I'd like to use STL if possible, but will have to choose the fastest solution. Does anyone have any experience in optimising STL with GCC?
If you're using GCC the compiler can analyze your matrix accesses and change the order in memory in certain cases. The magic compiler flag is defined as:
-fipa-matrix-reorg
Perform matrix flattening and
transposing. Matrix flattening tries
to replace a m-dimensional matrix with
its equivalent n-dimensional matrix,
where n < m. This reduces the level of
indirection needed for accessing the
elements of the matrix. The second
optimization is matrix transposing
that attemps to change the order of
the matrix's dimensions in order to
improve cache locality. Both
optimizations need fwhole-program
flag. Transposing is enabled only if
profiling information is avaliable.
Note that this option is not enabled by -O2 or -O3. You have to pass it yourself.
My guess would be the fastest is, for a matrix, to use 1D STL array and override the () operator to use it as 2D matrix.
However, the STL also defines a type specifically for non-resizeable numerical arrays: valarray. You also have various optimisations for in-place operations.
valarray accept as argument a numerical type:
valarray<double> a;
Then, you can use slices, indirect arrays, ... and of course, you can inherit the valarray and define your own operator()(int i, int j) for 2D arrays ...
Very likely this is a locality-of-reference issue. vector uses new to allocate its internal array, so each row will be at least a little apart in memory due to each block's header; it could be a long distance apart if memory is already fragmented when you allocate them. Different rows of the array are likely to at least incur a cache-line fault and could incur a page fault; if you're really unlucky two adjacent rows could be on memory lines that share a TLB slot and accessing one will evict the other.
In contrast your other solutions guarantee that all the data is adjacent. It could help your performance if you align the structure so it crosses as few cache lines as possible.
vector is designed for resizable arrays. If you don't need to resize the arrays, use a regular C++ array. STL operations can generally operate on C++ arrays.
Do be sure to walk the array in the correct direction, i.e. across (consecutive memory addresses) rather than down. This will reduce cache faults.
My recommendation would be to use Boost.UBLAS, which provides fast matrix/vector classes.
To be fair depends on the algorithms you are using upon the matrix.
The double name[n*m] format is very fast when you are accessing data by rows both because has almost no overhead besides a multiplication and addition and because your rows are packed data that will be coherent in cache.
If your algorithms access column ordered data then other layouts might have much better cache coherence. If your algorithm access data in quadrants of the matrix even other layouts might be better.
Try to make some research directed at the type of usage and algorithms you are using. That is specially important if the matrix are very large, since cache misses may hurt your performance way more than needing 1 or 2 extra math operations to access each address.
You could just as easily do vector< double >( n*m );
You may want to look at the Eigen C++ template library at http://eigen.tuxfamily.org/ . It generates AltiVec or sse2 code to optimize the vector/matrix calculations.
There is the uBLAS implementation in Boost. It is worth a look.
http://www.boost.org/doc/libs/1_36_0/libs/numeric/ublas/doc/matrix.htm
Another related library is Blitz++: http://www.oonumerics.org/blitz/docs/blitz.html
Blitz++ is designed to optimize array manipulation.
I have done this some time back for raw images by declaring my own 2 dimensional array classes.
In a normal 2D array, you access the elements like:
array[2][3]. Now to get that effect, you'd have a class array with an overloaded
[] array accessor. But, this would essentially return another array, thereby giving
you the second dimension.
The problem with this approach is that it has a double function call overhead.
The way I did it was to use the () style overload.
So instead of
array[2][3], change I had it do this style array(2,3).
That () function was very tiny and I made sure it was inlined.
See this link for the general concept of that:
http://www.learncpp.com/cpp-tutorial/99-overloading-the-parenthesis-operator/
You can template the type if you need to.
The difference I had was that my array was dynamic. I had a block of char memory I'd declare. And I employed a column cache, so I knew where in my sequence of bytes the next row began. Access was optimized for accessing neighbouring values, because I was using it for image processing.
It's hard to explain without the code but essentially the result was as fast as C, and much easier to understand and use.