How to efficiently store matmul results into another matrix in Eigen - c++

I would like to store multiple matmul results as row vector into another matrix, but my current code seems to take a lot of memory space. Here is my pseudo code:
for (int i = 0; i < C_row; ++i) {
C.row(i) = (A.transpose() * B).reshaped(1, C_col);
}
In this case, C is actually a Map of pre-allocated array declared as Map<Matrix<float, -1, -1, RowMajor>> C(C_array, C_row, C_col);.
Therefore, I expect the calculated matmul results can directly go to memory space of C and do not create temporary copies. In other words, the total memory usage should be the same with or without the above code. But I found that with the above code, the memory usage is increased significantly.
I tried to use C.row(i).noalias() to directly assign results to each row of C, but there is no memory usage difference. How to make this code more efficiently by taking less memory space?

The reshaped is the culprit. It cannot be folded into the matrix multiplication so it results in a temporary allocation for the multiplication. Ideally you would need to put it onto the left of the assignment:
C.row(i).reshaped(A.cols(), B.cols()).noalias() = A.transpose() * B;
However, that does not compile. Reshaped doesn't seem to fulfil the required interface. It's a pretty new addition to Eigen, so I'm not overly surprised. You might want to open a feature request on their bug tracker.
Anyway, as a workaround, try this:
Eigen::Map<Eigen::MatrixXd> reshaped(C.row(i).data(), A.cols(), B.cols());
reshaped.noalias() = A.transpose() * B;

Related

Eigen: Efficiently storing the output of a matrix evaluation in a raw pointer

I am using some legacy C code that passing around lots of raw pointers. To interface with the code, I have to pass a function of the form:
const int N = ...;
T * func(T * x) {
// TODO Put N elements in x
return x + N;
}
where this function should write the result into x, and then return x.
Internally, in this function, I am using Eigen extensively to perform some calculations. Then I write the result back to the raw pointer using the Map class. A simple example which mimics what I am doing is this:
const int N = 5;
T * func(T * x) {
// Do a lot of operations that result in some matrices like
Eigen::Matrix<T, N, 1 > A = ...
Eigen::Matrix<T, N, 1 > B = ...
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
return x + N;
}
Obviously, there is much more complicated stuff going on internally, but that is the gist of it... Do some calculations with Eigen, then use the Map class to write the result back to the raw pointer.
Now the problem is that when I profile this code with Callgrind, and then view the results with KCachegrind, the lines
constraint = A - B;
are almost always the bottleneck. This is sort of understandable, because such lines could/are potentially doing three things:
Constructing the Map object
Performing the calculation
Writing the result to the pointer
So it is understandable that this line might have the longest runtime. But I am a little bit worried that perhaps I am somehow doing an extra copy in that line before the data gets written to the raw pointer.
So is there a better way of writing the result to the raw pointer? Or is that the idiom I should be using?
In the back of my mind, I am wondering if using the placement new syntax would buy me anything here.
Note: This code is mission critical and should run in realtime, so I really need to squeeze every ounce of speed out of it. For instance, getting this call from a runtime of 0.12 seconds to 0.1 seconds would be huge for us. But code legibility is also a huge concern since we are constantly tweaking the model used in the internal calculations.
These two lines of code:
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
are essentially compiled by Eigen as:
for(int i=0; i<N; ++i)
x[i] = A[i] - B[i];
The reality is a bit more complicated because of explicit unrolling, and explicit vectorization (both depends on T), but that's essentially it. So the construction of the Map object is essentially a no-op (it is optimized away by any compiler) and no, there is no extra copy going on here.
Actually, if your profiler is able to tell you that the bottleneck lies on this simple expression, then that very likely means that this piece of code has not been inlined, meaning that you did not enabled compiler optimizations flags (like -O3 with gcc/clang).

data locality for implementing 2d array in c/c++

Long time ago, inspired by "Numerical recipes in C", I started to use the following construct for storing matrices (2D-arrays).
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
for (i = 0; i < NumRows; ++i) x[i] = (double *)calloc(NumCol, sizeof(double));
return x;
}
double **x = allocate_matrix(1000,2000);
x[m][n] = ...;
But recently noticed that many people implement matrices as follows
double *x = (double *)malloc(NumRows * NumCols * sizeof(double));
x[NumCol * m + n] = ...;
From the locality point of view the second method seems perfect, but has awful readability... So I started to wonder, is my first method with storing auxiliary array or **double pointers really bad or the compiler will optimize it eventually such that it will be more or less equivalent in performance to the second method? I am suspicious because I think that in the first method two jumps are made when accessing the value, x[m] and then x[m][n] and there is a chance that each time the CPU will load first the x array and then x[m] array.
p.s. do not worry about extra memory for storing **double, for large matrices it is just a small percentage.
P.P.S. since many people did not understand my question very well, I will try to re-shape it: do I understand right that the first method is kind of locality-hell, when each time x[m][n] is accessed first x array will be loaded into CPU cache and then x[m] array will be loaded thus making each access at the speed of talking to RAM. Or am I wrong and the first method is also OK from data-locality point of view?
For C-style allocations you can actually have the best of both worlds:
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
x[0] = (double *)calloc(NumRows * NumCol, sizeof(double)); // <<< single contiguous memory allocation for entire array
for (i = 1; i < NumRows; ++i) x[i] = x[i - 1] + NumCols;
return x;
}
This way you get data locality and its associated cache/memory access benefits, and you can treat the array as a double ** or a flattened 2D array (array[i * NumCols + j]) interchangeably. You also have fewer calloc/free calls (2 versus NumRows + 1).
No need to guess whether the compiler will optimize the first method. Just use the second method which you know is fast, and use a wrapper class that implements for example these methods:
double& operator(int x, int y);
double const& operator(int x, int y) const;
... and access your objects like this:
arr(2, 3) = 5;
Alternatively, if you can bear a little more code complexity in the wrapper class(es), you can implement a class that can be accessed with the more traditional arr[2][3] = 5; syntax. This is implemented in a dimension-agnostic way in the Boost.MultiArray library, but you can do your own simple implementation too, using a proxy class.
Note: Considering your usage of C style (a hardcoded non-generic "double" type, plain pointers, function-beginning variable declarations, and malloc), you will probably need to get more into C++ constructs before you can implement either of the options I mentioned.
The two methods are quite different.
While the first method allows for easier direct access to the values by adding another indirection (the double** array, hence you need 1+N mallocs), ...
the second method guarantees that ALL values are stored contiguously and only requires one malloc.
I would argue that the second method is always superior. Malloc is an expensive operation and contiguous memory is a huge plus, depending on the application.
In C++, you'd just implement it like this:
std::vector<double> matrix(NumRows * NumCols);
matrix[y * numCols + x] = value; // Access
and if you're concerned with the inconvenience of having to compute the index yourself, add a wrapper that implements operator(int x, int y) to it.
You are also right that the first method is more expensive when accessing the values. Because you need two memory lookups as you described x[m] and then x[m][n]. There is no way the compiler will "optimize this away". The first array, depending on its size, will be cached, and the performance hit may not be that bad. In the second case, you need an extra multiplication for direct access.
In the first method you use, the double* in the master array point to logical columns (arrays of size NumCol).
So, if you write something like below, you get the benefits of data locality in some sense (pseudocode):
foreach(row in rows):
foreach(elem in row):
//Do something
If you tried the same thing with the second method, and if element access was done the way you specified (i.e. x[NumCol*m + n]), you still get the same benefit. This is because you treat the array to be in row-major order. If you tried the same pseudocode while accessing the elements in column-major order, I assume you'd get cache misses given that the array size is large enough.
In addition to this, the second method has the additional desirable property of being a single contiguous block of memory which further improves the performance even when you loop through multiple rows (unlike the first method).
So, in conclusion, the second method should be much better in terms of performance.
If NumCol is a compile-time constant, or if you are using GCC with language extensions enabled, then you can do:
double (*x)[NumCol] = (double (*)[NumCol]) malloc(NumRows * sizeof (double[NumCol]));
and then use x as a 2D array and the compiler will do the indexing arithmetic for you. The caveat is that unless NumCol is a compile-time constant, ISO C++ won't let you do this, and if you use GCC language extensions you won't be able to port your code to another compiler.

Optimizing a quadruple nested "for" loop

I'm developing a 2D numerical model in c++, and I would like to speed up a specific member function that is slowing down my code. The function is required to loop over every i,j grid point in the model and then perform a double summation at every grid point over l and m. The function is as follows:
int Class::Function(void) {
double loadingEta;
int i,j,l,m;
//etaLatLen=64, etaLonLen=2*64
//l_max = 12
for (i=0; i<etaLatLen; i++) {
for (j=0; j < etaLonLen; j++) {
loadingEta = 0.0;
for (l=0; l<l_max+1; l++) {
for (m=0; m<=l; m++) {
loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
}
}
etaNewArray[i][j] = loadingEta;
}
}
return 1;
}
I've been trying to change the loop order to speed things up, but to no avail. Any help would be much appreciated. Thank you!
EDIT 1:
All five arrays are allocated in the constructor of my class as follows:
etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
etaLegendreArray[i] = new double*[l_max+1];
for (int l=0; l<l_max+1; l++) {
etaLegendreArray[i][l] = new double[l_max+1];
}
}
SH_C = new double*[l_max+1];
SH_S = new double*[l_max+1];
for (int i=0; i<l_max+1; i++) {
SH_C[i] = new double[l_max+1];
SH_S[i] = new double[l_max+1];
}
etaCosMLon = new double*[etaLonLen];
etaSinMLon = new double*[etaLonLen];
for (int j=0; j<etaLonLen; j++) {
etaCosMLon[j] = new double[l_max+1];
etaSinMLon[j] = new double[l_max+1];
}
Perhaps it would be better if these were 1D arrays instead of multidimensional?
Hopping off into X-Y territory here. Rather than speeding up the algorithm, let's try and speed up data access.
etaLegendreArray = new double**[etaLatLen];
for (int i=0; i<etaLatLen; i++) {
etaLegendreArray[i] = new double*[l_max+1];
for (int l=0; l<l_max+1; l++) {
etaLegendreArray[i][l] = new double[l_max+1];
}
}
Doesn't create a 3D array of doubles. It creates an array of pointers to arrays of pointers to arrays of doubles. Each array is its own block of memory and who knows where it's going to sit in storage. This results in a data structure that has what is called "poor spacial locality." All of the pieces of the structure may be scattered all over the place. In the 3D array you are hopping to three different places just to find out where your value is.
Because the many blocks of storage required to simulate the 3D array may be nowhere near each other, the CPU may not be able to effectively load the cache (high-speed memory) ahead of time and has to stop the useful work it's doing and wait to access slower storage, probably RAM much more frequently. Here is a good, high-level article on how much this can hurt performance.
On the other hand, if the whole array is in one block of memory, is "contiguous", the CPU can read larger chunks of the memory, maybe all of it, it needs into cache all at once. Plus if the compiler knows the memory the program will use is all in one big block it can perform all sorts of groovy optimizations that will make your program even faster.
So how do we get a 3D array that's all one memory block? If the sizes are static, this is easy
double etaLegendreArray[SIZE1][SIZE2][SIZE3];
This doesn't look to be your case, so what you want to do is allocate a 1D array, because it will be one contiguous block of memory.
double * etaLegendreArray= new double [SIZE1*SIZE2*SIZE3];
and do the array indexing math by hand
etaLegendreArray[(x * SIZE2 + y) * SIZE3 + z] = data;
Looks like that ought to be slower with all the extra math, huh? Turns out the compiler is hiding math that looks a lot like that from you every time you use a []. You lose almost nothing, and certainly not as much as you lose with one unnecessary cache miss.
But it is insane to repeat that math all over the place, sooner or later you will screw up even if the drain on readability doesn't have you wishing for death first, so you really want to wrap the 1D array in a class to helper handle the math for you. And once you do that, you might as well have that class handle the allocation and deallocation so you can take advantage of all that RAII goodness. No more for loops of news and deletes all over the place. It's all wrapped up and tied with a bow.
Here is an example of a 2D Matrix class easily extendable to 3D. that will take care of the basic functionality you probably need in a nice predictable, and cache-friendly manner.
If the CPU supports it and the compiler is optimizing enough, you might get some small gain out of the C99 fma (fused multiply-add) function, to convert some of your two step operations (multiply, then add) to one step operations. It would also improve accuracy, since you only suffer floating point rounding once for fused operation, not once for multiplication and once for addition.
Assuming I'm reading it right, you could change your innermost loop's expression from:
loadingEta += etaLegendreArray[i][l][m] * (SH_C[l][m]*etaCosMLon[j][m] + SH_S[l][m]*etaSinMLon[j][m]);
to (note no use of += now, it's incorporated in fma):
loadingEta = fma(etaLegendreArray[i][l][m], fma(SH_C[l][m], etaCosMLon[j][m], SH_S[l][m]*etaSinMLon[j][m]), loadingEta);
I wouldn't expect anything magical performance-wise, but it might help a little (again, only with optimizations turned up enough for the compiler to inline hardware instructions to do the work; if it's calling a library function, you'll lose any improvements to the function call overhead). And again, it should improve accuracy a bit, by avoiding two rounding steps you were incurring.
Mind you, on some compilers with appropriate compilation flags, they'll convert your original code to hardware FMA instructions for you; if that's an option, I'd go with that, since (as you can see) the fma function tends to reduce code readability.
Your compiler may offer vectorized versions of floating point instructions as well, which might meaningfully improve performance (see previous link on automatic conversion to FMA).
Most other improvements would require more information about the goal, the nature of the input arrays being used, etc. Simple threading might gain you something, OpenMP pragmas might be something to look at as a way to simplify parallelizing the loop(s).

c++: passing Eigen-defined matrices to functions, and using them - best practice

I have a function which requires me to pass a fairly large matrix (which I created using Eigen) - and ranges from dimensions 200x200 -> 1000x1000. The function is more complex than this, but the bare bones of it are:
#include <Eigen/Dense>
int main()
{
MatrixXi mIndices = MatrixXi::Zero(1000,1000);
MatrixXi* pMatrix = &mIndices;
MatrixXi mTest;
for(int i = 0; i < 10000; i++)
{
mTest = pMatrix[0];
// Then do stuff to the copy
}
}
Is the reason that it takes much longer to run with a larger size of matrix because it takes longer to find the available space in RAM for the array when I set it equal to mTest? When I switch to a sparse array, this seems to be quite a lot quicker.
If I need to pass around large matrices, and I want to minimise the incremental effect of matrix size on runtime, then what is best practice here? At the moment, the same program is running slower in c++ than it is in Matlab, and obviously I would like to speed it up!
Best,
Ben
In the code you show, you are copying a 1,000,000 element 10,000 times. The assignment in the loop creates a copy.
Generally if you're passing an Eigen matrix to another function, it can be beneficial to accept the argument by reference.
It's not really clear from your code what you're trying to achieve however.

Performace of large 3D arrays: contiguous 1D storage vs T***

I wonder if anyone could advise on storage of large (say 2000 x 2000 x 2000) 3D arrays for finite difference discretization computations. Does contiguous storage float* give better performance then float*** on modern CPU architectures?
Here is a simplified example of computations, which are done over entire arrays:
for i ...
for j ...
for k ...
u[i][j][k] += v[i][j][k+1] + v[i][j][k-1]
+ v[i][j+1][k] + v[i][j-1][k] + v[i+1][j][k] + v[i-1][j][k];
Vs
u[i * iStride + j * jStride + k] += ...
PS:
Considering size of problems, storing T*** is a very small overhead. Access is not random. Moreover, I do loop blocking to minimize cache misses. I am just wondering how triple dereferencing in T*** case compares to index computation and single dereferencing in case of 1D array.
These are not apples-to-apples comparisons: a flat array is just that - a flat array, which your code partitions into segments according to some logic of linearizing a rectangular 3D array. You access an element of an array with a single dereference, plus a handful of math operations.
float***, on the other hand, lets you keep a "jagged" array of arrays or arrays, so the structure that you can represent inside such an array is a lot more flexible. Naturally, you need to pay for that flexibility with additional CPU cycles required for dereferencing pointers to pointers to pointers, then a pointer to pointer, and finally a pointer (the three pairs of square brackets in the code).
Naturally, access to the individual elements of float*** is going to be a little slower, if you access them in truly random order. However, if the order is not random, the difference that you see may be tiny, because the values of pointers would be cached.
float*** will also require more memory, because you need to allocate two additional levels of pointers.
The short answer is: benchmark it. If the results are inconclusive, it means it doesn't matter. Do what makes your code most readable.
As #dasblinkenlight has pointed out, the structures are nor equivalent because T*** can be jagged.
At the most fundamental level, though, this comes down to the arithmetic and memory access operations.
For your 1D array, as you have already (almost) written, the calculation is:
ptr = u + (i * iStride) + (j * jStride) + k
read *ptr
With T***:
ptr = u + i
x = read ptr
ptr = x + j
y = read ptr
ptr = y + k
read ptr
So you are trading two multiplications for two memory accesses.
In computer go, where people are very performance-sensitive, everyone (AFAIK) uses T[361] rather than T[19][19] (*). This decision is based on benchmarking, both in isolation and the whole program. (It is possible everyone did those benchmarks years and years ago, and have never done them again on the latest hardware, but my hunch is that a single 1-D array will still be better.)
However your array is huge, in comparison. As the code involved in each case is easy to write, I would definitely try it both ways and benchmark.
*: Aside: I think it is actually T[21][21] vs. t[441], in most programs, as an extra row all around is added to speed up board-edge detection.
One issue that has not been mentioned yet is aliasing.
Does your compiler support some type of keyword like restrict to indicate that you have no aliasing? (It's not part of C++11 so would have to be an extension.) If so, performance may be very close to the same. If not, there could be significant differences in some cases. The issue will be with something like:
for (int i = ...) {
for (int j = ...) {
a[j] = b[i];
}
}
Can b[i] be loaded once per outer loop iteration and stored in a register for the entire inner loop? In the general case, only if the arrays don't overlap. How does the compiler know? It needs some type of restrict keyword.