C++ Eigen Matrix Operations vs. Memory Allocation Performance - c++

I have an algorithm that requires the construction of an NxN matrix inside a function that will return the product of this matrix with an Nx1 vector that's also built on the fly.
(N is usually 8 or 9, but must be generalized for values greater than that).
I'm using the Eigen library for performing algebraic operations that are even more complex (least squares and several other constrained problems), so switching it isn't an option.
I've benchmarked the functions, and there's a huge bottleneck due to the intensive memory
allocations. I aim to build a thread safe application, so, for some cases, I replaced these matrices and vectors with references to elements from a global vector that serves as a provider for objects that cannot be stored on the stack. This avoids calling the constructors/destructors of the Eigen matrices and vectors, but it's not an elegant solution and it can lead to huge problems if considerable care is not taken.
As such, does Eigen either offer a workaround because I don't see the option of passing an allocator as a template argument for these objects, OR is there a more obvious thing to do?

You can manage your own memory in a way that fits your needs and use Eigen::Map instead of Eigen::Matrix to perform calculations with it. Just make sure the data is aligned properly or notify Eigen if it isn't.
See the reference Eigen::Map for details.
Here is short example:
#include <iostream>
#include <Eigen/Core>
int main() {
int mydata[3 * 4]; // Manage your own memory as you see fit
int* data_ptr = mydata;
Eigen::Map<Eigen::MatrixXi, Eigen::Unaligned> mymatrix(data_ptr, 3, 4);
// use mymatrix like you would any another matrix
mymatrix = Eigen::MatrixXi::Zero(3, 4);
std::cout << mymatrix << '\n';
// This line will trigger a failed assertion in debug mode
// To change it see
// http://eigen.tuxfamily.org/dox-devel/TopicAssertions.html
mymatrix = Eigen::MatrixXi::Ones(3, 6);
std::cout << mymatrix << '\n';
}

To gather my comments into a full idea. Here is how I would try to do it.
Because the memory allocation in eigen is a pretty advanced stuff IMO and they do not expose much places to tap into it. The best bet is to wrap eigen objects itself into some kind of resource manager, like OP did.
I would make it a simple bin, that hold Matrix< Scalar, Dynamic, Dynamic> objects. This way you template the Scalar type and have a manager for generalized size matrices.
Whenever you call for an object, you check if you have a free object of the desired size, you return reference to it. If not, you allocate a new one. Simple. when you want to release the object, then you mark it free in the resource manager. I don't think anything more complicated is needed, but of course it's possible to implement some more sophisticated logic.
To ensure thread safety I would put a lock in the manager. Initialize it in the constructor if needed. Of course locking on free and allocate would be needed.
However depending on the work schedule. If the threads work on their own arrays I would consider to make one resource manager instance for each thread, so they don't clock each other. The thing is, that a global lock or a global manager would possibly get exhausted if you have like 12 cores working heavy on allocations/deallocations, and effectively serialize your app thourgh this one lock.

You can try replacing your default memory allocator with jemalloc or tcmalloc. It's pretty easy to try out thanks to the LD_PRELOAD mechanism.
https://github.com/jemalloc/jemalloc/wiki/Getting-Started
http://goog-perftools.sourceforge.net/doc/tcmalloc.html
C++ memory allocation mechanism performance comparison (tcmalloc vs. jemalloc)
what're the differences between tcmalloc/jemalloc and memory pool
I think it works for most C++ projects as well.

You could allocate memory for some common matrix sizes before calling that function with operator new or operator new[], store the void* pointers somewhere and let the function itself retrieve an memory block with the right size. After that, you can use placement new for matrix construction. Details are given in More effective C++, item 8.

Related

Why Eigen limits size on the stack?

I have recently found that Eigen limits the size of static matrices with EIGEN_STACK_ALLOCATION_LIMIT (to 128kB).
What are the reasons for this limit?
A few reasons come to mind:
Eigen is often used in multithreaded code. Threads other than the main thread usually have a fixed stack size. 2-8 MiB are common limits. In addition to that, some Eigen, BLAS and LAPACK routines use alloca internally. All of that combined doesn't leave much room until the application crashes with such large matrices
There is very little benefit. Allocation cost will be dwarfed by whatever you do with it at such a size.
There are potential hidden costs. Just like std::array, constant time move construction / move assignment are not possible. Consider this simple example:
using Matrix = Eigen::Matrix<double, 128, 128>;
Matrix fill();
void foo()
{
Matrix x;
x = fill();
}
You might think that you assign directly to x and thanks to copy-elision, there is no extra cost. But in reality, the compiler allocates stack space of a temporary matrix. Then fill() stores its result in there. Then that result is copied to x. Copy-elision cannot work in such a case because the return value needs to be alias-free.
With a dynamic matrix, we would simply swap some pointers and be done with it.

C++ array of Eigen dynamically sized matrix

In my application, I have a one-dimensional grid and for each grid point there is a matrix (equally sized and quadratic). For each matrix, a certain update procedure must be performed. At the moment, I define a type
typedef Eigen::Matrix<double, N, N> my_matrix_t;
and allocate the matrices for all grid points using
my_matrix_t *matrices = new my_matrix_t[num_gridpoints];
Now I would like to address matrices whose sizes are only known at run time (but still quadratic), i.e.,
typedef Eigen::Matrix<double, Dynamic, Dynamic> my_matrix_t;
The allocation procedure remains the same and the code seems to work. However, I assume that the array "matrices" contains only the pointers to the each individual matrix storage and the overall performance will degrade as the memory has to be collected from random places before the operation on each matrix can be carried out.
Q0: Contiguous Memory?
Is the assumption correct that
new[] will only store the pointers and the matrix data is stored anywhere on the head?
it is beneficial to have a contiguous memory region for such problems?
Q1: new[] or std::vector?
Using a std::vector was suggested in the comments. Does this make any difference? Advantages/drawbacks of both solutions?
Q2: Overloading new[]?
I think by overloading the operator new[] in the Eigen::Matrix class (or one of its bases) such an allocation could be achieved. Is this a good idea?
Q3: Alternative ways?
As an alternative, I could think of using a large Eigen::Matrix. Can anyone share their experience here? Do you have other suggestions for me?
Let us sum up what we have so far based on the comments to the question and the mailing list post here. I would like to encourage everyone to edit and add things.
Q0: Contiguous memory region.
Yes, only the pointers are stored (independent of using new[] or std::vector).
Generally, in HPC applications, continuous memory accesses are beneficial.
Q1: The basic mechanisms are the same.
However, std::vector offers more comfort and takes work off the developer. The latter also reduces mistakes and memory leaks.
Q2: Use std::vector.
Overloading new[] is not recommended as it is difficult to get it right. For example, alignment issues could lead to errors on different machines. In order to guarantee correct behavior on all machines, use
std::vector<my_matrix_t, Eigen::aligned_allocator<my_matrix_t>> storage;
as explained here.
Q3: Use a large Eigen Matrix for the complete grid.
Alternatively, let the Eigen library do the complete allocation directly by using on of its data structures. This guarantees that issues such as alignment and a continuous memory region are addressed properly. The matrix
Eigen::Matrix<double, Dynamic, Dynamic> storage(N, num_grid_points * N);
contains all matrices for the complete grid and can be addressed using
/* i + 1 'th matrix for i in [0, num_gridpoints - 1] */
auto matrix = storage.block(0, i * N, N, N);

Storing big matrices in C++ (Armadillo)

I'm using the Armadillo library in C++ for storing / calculating large matrices. It is my understanding that one should store large arrays / matrices dynamically (on the heap).
Suppose I declare a matrix
mat X;
and set the size to be (say) 500 rows, 500 columns with random entries:
X.randn(500,500);
Does Armadillo store X dynamically (i.e. on the heap) despite not using new or delete.? The reason I ask, is because it seems Armadillo allows me to declare a variable as:
mat::fixed<n_rows, n_cols>
which, I quote: "is generally faster than dynamic memory allocation, but the size of the matrix can't be changed afterwards (directly or indirectly)".
Regardless of the above -- should I use this:
mat A;
A.set_size(n-1,n-1);
or this:
mat *A = new mat;
(*A).set_size(n-1,n-1);
where n is between 1000 or 100000 and not known in advance.
Does Armadillo store X dynamically (i.e. on the heap) despite not
using new or delete.?
Yes. There will be some form of new or delete in the library code. You just don't notice it from the outside.
The reason I ask, is because it seems Armadillo
allows me to declare a variable as (mat::fixed ...)
You'd have to look into the source code to see what's going on exactly here. My guess is that it has some kind of internal logic that decides how to deal with things based on size. You would normally use mat::fixed for small matrices, though.
Following that, you should use
mat A(n-1,n-1);
if you know the size at that point already. In some cases,
mat A;
A.set_size(n-1,n-1);
might also be okay.
I can't think of a good reason to use your second option with the mat * pointer. First of all, libraries like armadillo handle their memory allocations internally, and developers take great care to get it right. Also, even if the memory code in the library was broken, your idea new mat wouldn't fix it: You would allocate memory for a mat object, but that object is certainly rather small. The big part is probably hidden behind something like a member variable T* data in the class mat, and you cannot influence how this is allocated from the outside.
I initially missed your comment on the size of n. As Mikhail says, dealing with 100000x100000 matrices will require much more care than simply thinking about the way you instantiate them.

Should I use manual alloc to allow move semantics?

I'm interested to learn when I should start considering using move semantics in favour over copying data depending on the size of that data and the usage of the class. For example for a Matrix4 class we have two options:
struct Matrix4{
float* data;
Matrix4(){ data = new float[16]; }
Matrix4(Matrix4&& other){
*this = std::move(other);
}
Matrix4& operator=(Matrix4&& other)
{
... removed for brevity ...
}
~Matrix4(){ delete [] data; }
... other operators and class methods ...
};
struct Matrix4{
float data[16]; // let the compiler do the magic
Matrix4(){}
Matrix4(const Matrix4& other){
std::copy(other.data, other.data+16, data);
}
Matrix4& operator=(const Matrix4& other)
{
std::copy(other.data, other.data+16, data);
}
... other operators and class methods ...
};
I believe there is some overhead having to alloc and dealloc memory "by hand", and given the chances of really hitting the move construct when using this class what is the preferred implementations of a class with such small in memory size? Is really always preferred move over copy?
In the first case, allocation and deallocation are expensive - because you are dynamically allocating memory from the heap, even if your matrix is constructed on the stack - and moves are cheap (just copying a pointer).
In the second case, allocation and deallocation are cheap, but moves are expensive - because they are actually copies.
So if you are writing an application and you just care about performance of that application, the answer to the question "Which one is better?" likely depends on how much you are creating/destroying matrices vs how much you are copying/moving them - and in any case, do your own measurements to support any conjectures.
By doing measurements you will also check whether your compiler is doing a lot of copy/move elisions in places where you expect moves to be going on - results may be against your expectations.
Also, cache locality may have an impact here: if you allocate storage for a matrix's data on the heap, having three matrices that you want to process element-by-element created on the stack will likely require quite a scattered memory access pattern - potentially result in more cache misses.
On the other hand, if you are using arrays for which memory is allocated on the stack, it is likely that the same cache line will be able to hold the data of all those matrices - thus increasing the cache hit rate. Not to mention the fact that in order to access elements on the heap you first need to read the value of the data pointer, which means accessing a different region of memory than the one holding the elements.
So once more, the moral of the story is: do your own measurements.
If you are writing a library on the other hand, and you cannot predict how many constructions/destructions vs moves/copies the client is going to perform, then you may offer two such matrix classes, and factor out the common behavior into a base class - possibly a base class template.
That will give the client flexibility and will give you a sufficiently high degree of reuse - no need to write the implementation of all common member functions twice.
This way, clients may choose the matrix class that best fits the creation/moving profile of the application in which they are using it.
UPDATE:
As DeadMG points out in the comments, one advantage of the array-based approach over the dynamic allocation approach is that the latter is doing manual resource management through raw pointers, new, and delete, which forces you to write user-defined destructor, copy constructor, move constructor, copy-assignment operator, and move-assignment operator.
You could avoid all of this if you were using std::vector, which would perform the memory management task for you and would save you from the burden of defining all those special member functions.
This said, the mere fact of suggesting to use std::vector instead of doing manual memory management - as much as it is a good advice in terms of design and programming practice - does not answer the question, while I believe the original answer does.
Like everything else in programming, specially when performance is concerned, it's a complicated trade-off.
Here, you have two designs: to keep the data inside your class (method 1) or to allocate the data on the heap and keep a pointer to it in the class (method 2).
As far as I can tell, these are the trade-offs you are making:
Construction/Destruction Speed: Naively implemented, method 2 will be slower here, because it requires dynamic memory allocation and deallocation. However, you can help the situation using custom memory allocators, specially if the size of your data is predictable and/or fixed.
Size: In your 4x4 matrix example, method 2 requires storing an additional pointer, plus memory allocation size overhead (typically can be anywhere from 4 to 32 bytes.) This might or might not be a factor, but it certainly must be considered, specially if your class instances are small.
Move Speed: Method 2 has very fast move operation, because it only requires setting two pointers. In method 1, you have no choice but to copy your data. However, while being able to rely on fast moving can make your code pretty and straightforward and readable and more efficient, compilers are quite good at copy elision, which means that you can write your pretty, straightforward and readable pass-by-value interfaces even if you implement method 1 and the compiler will not generate too many copies anyway. But you can't be sure of that, so relying on this compiler optimization, specially if your instances are larger, requires measurement and inspection of the generated code.
Member Access Speed: This is the most important differentiator for small classes, in my opinion. Each time you access an element in a matrix implemented using method 2 (or access a field in a class implemented that way, i.e., with external data) you access the memory twice: once to read the address of the external block of memory, and once to actually read the data you want. In method 1, you just directly access the field or element you want. This means that in method 2, every access could potentially generate an additional cache miss, which could affect your performance. This is specially important if your class instances are small (e.g. a 4x4 matrix) and you operate on many of them stored in arrays or vectors.
In fact, this is why you might want to actually copy bytes around when you are copying/moving an instance of your matrix, instead of just setting a pointer: to keep your data contiguous. This is why flat data structures (like arrays of values,) are much preferred in high-performance code, than pointer spaghetti data structures (like arrays of pointers, linked lists, etc.) So, while moving is cooler and faster than copying in isolation, you sometimes want to do copy your instances to make (or keep) a whole bunch of them contiguous and make iteration over and accessing them much much more efficient.
Flexibility of Length/Size: Method 2 is obviously more flexible in this regard because you can decide how much data you need at runtime, be it 16 or 16777216 bytes.
All in all, here's the algorithm I suggest you use for picking one implementation:
If you need variable amount of data, pick method 2.
If you have very large amounts of data in each instance of your class (e.g. several kilobytes,) pick method 2.
If you need to copy instances of your class around a lot (and I mean a lot!) pick method 2 (but try to measure the performance improvement and inspect the generated code, specially in hot areas.)
In all other cases, prefer method 1.
In short, method 1 should be your default, until proven otherwise. And the way to prove anything regarding performance is measurement! So don't optimize anything unless you have measured and have proof that one method is better than another, and also (as mentioned in other answers,) you might want to implement both methods if you are writing a library and let your users choose the implementation.
I would probably use a stdlib container (such as std::vector or std::array) that already implements move semantics, and then I would simply have the vectors or arrays move.
For example, you could use std::array< std::array, 4 > or std::vector< std::vector< float > > to represent your matrix type.
I don't think it will matter a lot for a 4x4 matrix, but it might for 10000x10000. So yes, a move constructor for a matrix type is definitely worth it, especially if you're planning to work with a lot of temporary matrices (which seems likely when you want to do calculations with them). It will also allow returning Matrix4 objects efficiently (whereas you'd have to use a by-ref call to get around copying otherwise).
Rather unrelated to the matter but probably worth mentioning: in case you decide to use std::array, please make a Matrix a template class (instead of embedding the size into the classname).

Using STL Allocator with STL Vectors

Here's the basic problem. There's an API which I depend on, with a method using the following syntax:
void foo_api (std::vector<type>& ref_to_my_populated_vector);
The area of code in question is rather performance intensive, and I want to avoid using the heap to allocate memory. As a result, I created a custom allocator which allocates the memory required for the vector on the stack. So, I can now define a vector as:
// Create the stack allocator, with room for 100 elements
my_stack_allocator<type, 100> my_allocator;
// Create the vector, specifying our stack allocator to use
std::vector<type, my_stack_allocator> my_vec(my_allocator);
This is all fine. Performance tests using the stack allocated vector compared to the standard vector show performance is roughly 4x faster. The problem is, I can't call foo_api! So...
foo_api(my_vec); // Results in an error due to incompatible types.
// Can't convert std::vector<type> to std::vector<type, allocator>
Is there a solution to this?
You have to use the default allocator just as the function expects. You have two different types, and there's no way around that.
Just call reserve prior to operating on the vector to get the memory allocations out of the way.
Think about the bad things that could happen. That function may take your vector and start adding more elements. Soon, you could over-flow the stack space you've allocated; oops!
If you're really concerned about performance, a much better route is to replace operator new and kin with a custom memory manager. I have done so and allocations can be hugely improved. For me, allocating sizes of size 512 or less is about 4 operations (move a couple pointers around); I used a pool allocator)