Eigen: Efficiently storing the output of a matrix evaluation in a raw pointer - c++

I am using some legacy C code that passing around lots of raw pointers. To interface with the code, I have to pass a function of the form:
const int N = ...;
T * func(T * x) {
// TODO Put N elements in x
return x + N;
}
where this function should write the result into x, and then return x.
Internally, in this function, I am using Eigen extensively to perform some calculations. Then I write the result back to the raw pointer using the Map class. A simple example which mimics what I am doing is this:
const int N = 5;
T * func(T * x) {
// Do a lot of operations that result in some matrices like
Eigen::Matrix<T, N, 1 > A = ...
Eigen::Matrix<T, N, 1 > B = ...
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
return x + N;
}
Obviously, there is much more complicated stuff going on internally, but that is the gist of it... Do some calculations with Eigen, then use the Map class to write the result back to the raw pointer.
Now the problem is that when I profile this code with Callgrind, and then view the results with KCachegrind, the lines
constraint = A - B;
are almost always the bottleneck. This is sort of understandable, because such lines could/are potentially doing three things:
Constructing the Map object
Performing the calculation
Writing the result to the pointer
So it is understandable that this line might have the longest runtime. But I am a little bit worried that perhaps I am somehow doing an extra copy in that line before the data gets written to the raw pointer.
So is there a better way of writing the result to the raw pointer? Or is that the idiom I should be using?
In the back of my mind, I am wondering if using the placement new syntax would buy me anything here.
Note: This code is mission critical and should run in realtime, so I really need to squeeze every ounce of speed out of it. For instance, getting this call from a runtime of 0.12 seconds to 0.1 seconds would be huge for us. But code legibility is also a huge concern since we are constantly tweaking the model used in the internal calculations.

These two lines of code:
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
are essentially compiled by Eigen as:
for(int i=0; i<N; ++i)
x[i] = A[i] - B[i];
The reality is a bit more complicated because of explicit unrolling, and explicit vectorization (both depends on T), but that's essentially it. So the construction of the Map object is essentially a no-op (it is optimized away by any compiler) and no, there is no extra copy going on here.
Actually, if your profiler is able to tell you that the bottleneck lies on this simple expression, then that very likely means that this piece of code has not been inlined, meaning that you did not enabled compiler optimizations flags (like -O3 with gcc/clang).

Related

How to efficiently store matmul results into another matrix in Eigen

I would like to store multiple matmul results as row vector into another matrix, but my current code seems to take a lot of memory space. Here is my pseudo code:
for (int i = 0; i < C_row; ++i) {
C.row(i) = (A.transpose() * B).reshaped(1, C_col);
}
In this case, C is actually a Map of pre-allocated array declared as Map<Matrix<float, -1, -1, RowMajor>> C(C_array, C_row, C_col);.
Therefore, I expect the calculated matmul results can directly go to memory space of C and do not create temporary copies. In other words, the total memory usage should be the same with or without the above code. But I found that with the above code, the memory usage is increased significantly.
I tried to use C.row(i).noalias() to directly assign results to each row of C, but there is no memory usage difference. How to make this code more efficiently by taking less memory space?
The reshaped is the culprit. It cannot be folded into the matrix multiplication so it results in a temporary allocation for the multiplication. Ideally you would need to put it onto the left of the assignment:
C.row(i).reshaped(A.cols(), B.cols()).noalias() = A.transpose() * B;
However, that does not compile. Reshaped doesn't seem to fulfil the required interface. It's a pretty new addition to Eigen, so I'm not overly surprised. You might want to open a feature request on their bug tracker.
Anyway, as a workaround, try this:
Eigen::Map<Eigen::MatrixXd> reshaped(C.row(i).data(), A.cols(), B.cols());
reshaped.noalias() = A.transpose() * B;

Dealing with a contiguous vector of fixed-size matrices for both storage layouts in Eigen

An external library gives me a raw pointer of doubles that I want to map to an Eigen type. The raw array is logically a big ordered collection of small dense fixed-size matrices, all of the same size. The main issue is that the small dense matrices may be in row-major or column-major ordering and I want to accommodate them both.
My current approach is as follows. Note that all the entries of a small fixed-size block (in the array of blocks) need to be contiguous in memory.
template<int bs, class Mattype>
void block_operation(double *const vals, const int numblocks)
{
Eigen::Map<Mattype> mappedvals(vals,
Mattype::IsRowMajor ? numblocks*bs : bs,
Mattype::IsRowMajor ? bs : numblocks*bs
);
for(int i = 0; i < numblocks; i++)
if(Mattype::isRowMajor)
mappedvals.template block<bs,bs>(i*bs,0) = block_operation_rowmajor(mappedvals);
else
mappedvals.template block<bs,bs>(0,i*bs) = block_operation_colmajor(mappedvals);
}
The calling function first figures out the Mattype (out of 2 options) and then calls the above function with the correct template parameter.
Thus all my algorithms need to be written twice and my code is interspersed with these layout checks. Is there a way to do this in a layout-agnostic way? Keep in mind that this code needs to be as fast as possible.
Ideally, I would Map the data just once and use it for all the operations needed. However, the only solution I could come up with was invoking the Map constructor once for every small block, whenever I need to access the block.
template<int bs, StorageOptions layout>
inline Map<Matrix<double,bs,bs,layout>> extractBlock(double *const vals,
const int bindex)
{
return Map<Matrix<double,bs,bs,layout>>(vals+bindex*bs*bs);
}
Would this function be optimized away to nothing (by a modern compiler like GCC 7.3 or Intel 2017 under -std=c++14 -O3), or would I be paying a small penalty every time I invoke this function (once for each block, and there are a LOT of small blocks)? Is there a better way to do this?
Your extractBlock is fine, a simpler but somewhat uglier solution is to use a reinterpret cast at the start of block_operation:
using BlockType = Matrix<double,bs,bs,layout|DontAlign>;
BlockType* blocks = reinterpret_cast<BlockType*>(vals);
for(int i...)
block[i] = ...;
This will work for fixed sizes matrices only. Also note the DontAlign which is important unless you can guaranty that vals is aligned on a 16 or even 32 bytes depending on the presence of AVX and bs.... so just use DontAlign!

data locality for implementing 2d array in c/c++

Long time ago, inspired by "Numerical recipes in C", I started to use the following construct for storing matrices (2D-arrays).
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
for (i = 0; i < NumRows; ++i) x[i] = (double *)calloc(NumCol, sizeof(double));
return x;
}
double **x = allocate_matrix(1000,2000);
x[m][n] = ...;
But recently noticed that many people implement matrices as follows
double *x = (double *)malloc(NumRows * NumCols * sizeof(double));
x[NumCol * m + n] = ...;
From the locality point of view the second method seems perfect, but has awful readability... So I started to wonder, is my first method with storing auxiliary array or **double pointers really bad or the compiler will optimize it eventually such that it will be more or less equivalent in performance to the second method? I am suspicious because I think that in the first method two jumps are made when accessing the value, x[m] and then x[m][n] and there is a chance that each time the CPU will load first the x array and then x[m] array.
p.s. do not worry about extra memory for storing **double, for large matrices it is just a small percentage.
P.P.S. since many people did not understand my question very well, I will try to re-shape it: do I understand right that the first method is kind of locality-hell, when each time x[m][n] is accessed first x array will be loaded into CPU cache and then x[m] array will be loaded thus making each access at the speed of talking to RAM. Or am I wrong and the first method is also OK from data-locality point of view?
For C-style allocations you can actually have the best of both worlds:
double **allocate_matrix(int NumRows, int NumCol)
{
double **x;
int i;
x = (double **)malloc(NumRows * sizeof(double *));
x[0] = (double *)calloc(NumRows * NumCol, sizeof(double)); // <<< single contiguous memory allocation for entire array
for (i = 1; i < NumRows; ++i) x[i] = x[i - 1] + NumCols;
return x;
}
This way you get data locality and its associated cache/memory access benefits, and you can treat the array as a double ** or a flattened 2D array (array[i * NumCols + j]) interchangeably. You also have fewer calloc/free calls (2 versus NumRows + 1).
No need to guess whether the compiler will optimize the first method. Just use the second method which you know is fast, and use a wrapper class that implements for example these methods:
double& operator(int x, int y);
double const& operator(int x, int y) const;
... and access your objects like this:
arr(2, 3) = 5;
Alternatively, if you can bear a little more code complexity in the wrapper class(es), you can implement a class that can be accessed with the more traditional arr[2][3] = 5; syntax. This is implemented in a dimension-agnostic way in the Boost.MultiArray library, but you can do your own simple implementation too, using a proxy class.
Note: Considering your usage of C style (a hardcoded non-generic "double" type, plain pointers, function-beginning variable declarations, and malloc), you will probably need to get more into C++ constructs before you can implement either of the options I mentioned.
The two methods are quite different.
While the first method allows for easier direct access to the values by adding another indirection (the double** array, hence you need 1+N mallocs), ...
the second method guarantees that ALL values are stored contiguously and only requires one malloc.
I would argue that the second method is always superior. Malloc is an expensive operation and contiguous memory is a huge plus, depending on the application.
In C++, you'd just implement it like this:
std::vector<double> matrix(NumRows * NumCols);
matrix[y * numCols + x] = value; // Access
and if you're concerned with the inconvenience of having to compute the index yourself, add a wrapper that implements operator(int x, int y) to it.
You are also right that the first method is more expensive when accessing the values. Because you need two memory lookups as you described x[m] and then x[m][n]. There is no way the compiler will "optimize this away". The first array, depending on its size, will be cached, and the performance hit may not be that bad. In the second case, you need an extra multiplication for direct access.
In the first method you use, the double* in the master array point to logical columns (arrays of size NumCol).
So, if you write something like below, you get the benefits of data locality in some sense (pseudocode):
foreach(row in rows):
foreach(elem in row):
//Do something
If you tried the same thing with the second method, and if element access was done the way you specified (i.e. x[NumCol*m + n]), you still get the same benefit. This is because you treat the array to be in row-major order. If you tried the same pseudocode while accessing the elements in column-major order, I assume you'd get cache misses given that the array size is large enough.
In addition to this, the second method has the additional desirable property of being a single contiguous block of memory which further improves the performance even when you loop through multiple rows (unlike the first method).
So, in conclusion, the second method should be much better in terms of performance.
If NumCol is a compile-time constant, or if you are using GCC with language extensions enabled, then you can do:
double (*x)[NumCol] = (double (*)[NumCol]) malloc(NumRows * sizeof (double[NumCol]));
and then use x as a 2D array and the compiler will do the indexing arithmetic for you. The caveat is that unless NumCol is a compile-time constant, ISO C++ won't let you do this, and if you use GCC language extensions you won't be able to port your code to another compiler.

Safe and fast FFT

Inspired by Herb Sutter's compelling lecture Not your father's C++, I decided to take another look at the latest version of C++ using Microsoft's Visual Studio 2010. I was particularly interested by Herb's assertion that C++ is "safe and fast" because I write a lot of performance-critical code.
As a benchmark, I decided to try to write the same simple FFT algorithm in a variety of languages.
I came up with the following C++11 code that uses the built-in complex type and vector collection:
#include <complex>
#include <vector>
using namespace std;
// Must provide type or MSVC++ barfs with "ambiguous call to overloaded function"
double pi = 4 * atan(1.0);
void fft(int sign, vector<complex<double>> &zs) {
unsigned int j=0;
// Warning about signed vs unsigned comparison
for(unsigned int i=0; i<zs.size()-1; ++i) {
if (i < j) {
auto t = zs.at(i);
zs.at(i) = zs.at(j);
zs.at(j) = t;
}
int m=zs.size()/2;
j^=m;
while ((j & m) == 0) { m/=2; j^=m; }
}
for(unsigned int j=1; j<zs.size(); j*=2)
for(unsigned int m=0; m<j; ++m) {
auto t = pi * sign * m / j;
auto w = complex<double>(cos(t), sin(t));
for(unsigned int i = m; i<zs.size(); i+=2*j) {
complex<double> zi = zs.at(i), t = w * zs.at(i + j);
zs.at(i) = zi + t;
zs.at(i + j) = zi - t;
}
}
}
Note that this function only works for n-element vectors where n is an integral power of two. Anyone looking for fast FFT code that works for any n should look at FFTW.
As I understand it, the traditional xs[i] syntax from C for indexing a vector does not do bounds checking and, consequently, is not memory safe and can be a source of memory errors such as non-deterministic corruption and memory access violations. So I used xs.at(i) instead.
Now, I want this code to be "safe and fast" but I am not a C++11 expert so I'd like to ask for improvements to this code that would make it more idiomatic or efficient?
I think you are being overly "safe" in your use of at(). In most of your cases the index used is trivially verifable as being constrained by the size of the container in the for loop.
e.g.
for(unsigned int i=0; i<zs.size()-1; ++i) {
...
auto t = zs.at(i);
The only ones I'd leave as at()s are the (i + j)s. It's not immediately obvious whether they would always be constrained (although if I was really unsure I'd probably manually check - but I'm not familiar with FFTs enough to have an opinion in this case).
There are also some fixed computations being repeated for each loop iteration:
int m=zs.size()/2;
pi * sign
2*j
And the zs.at(i + j) is computed twice.
It's possible that the optimiser may catch these - but if you are treating this as performance critical, and you have your timers testing it, I'd hoist them out of the loops (or, in the case of zs.at(i + j), just take a reference on first use) and see if that impacts the timer.
Talking of second-guessing the optimiser: I'm sure that the calls to .size() will be inlined as, at least, a direct call to an internal member variable - but given how many times you call it I'd also experiment with introducing local variables for zs.size() and zs.size()-1 upfront. They're more likely to be put into registers that way too.
I don't know how much of a difference (if any) all of this will have on your total runtime - some of it may already be caught by the optimiser, and the differences may be small compared to the computations involved - but worth a shot.
As for being idiomatic my only comment, really, is that size() returns a std::size_t (which is usually a typedef for an unsigned int - but it's more idiomatic to use that type instead). If you did want to use auto but avoid the warning you could try adding the ul suffix to the 0 - not sure I'd say that is idiomatic, though. I suppose you're already less than idiomatic in not using iterators here, but I can see why you can't do that (easily).
Update
I gave all my suggestions a try and they all had a measurable performance improvement - except the i+j and 2*j precalcs - they actually caused a slight slowdown! I presume they either prevented a compiler optimisation or prevented it from using registers for some things.
Overall I got a >10% perf. improvement with those suggestions.
I suspect more could be had if the second block of loops was refactored a little to avoid the jumps - and having done so enabling SSE2 instruction set may give a significant boost (I did try it as is and saw a slight slowdown).
I think that refactoring, along with using something like MKL for the cos and sin calls should give greater, and less brittle, improvements. And neither of those things would be language dependent (I know this was originally being compared to an F# implementation).
Update 2
I forgot to mention that pre-calculating zs.size() did make a difference.
Update 3
Also forgot to say (until reminded by #xeo in comment to OP) that the block following the i < j check can be boiled down to a std::swap. This is more idiomatic and at least as performant - in the worst case should inline to the same code as written. Indeed when I did it I saw no change in the performance. In other cases it can lead to a performance gain if move constructors are available.

How much can this c/c++ loop be optimized? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am a newbie on optimization. I have been reading some reference on how to optimize c++ code but I have a hard time applying it to real code. Therefore I just want to gather some real world optimization technique on how to squeeze as much juice from the CPU/Memory as possible from the loop below
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
for(int t = 0; t < T; ++t){
sum += fun(a,b,c,d,e,f,sum);
*(array+t) = sum;
}
where a,b,c,d,e,f are double and T is int. Anything including but not limited to memory alignment, parallelism, openmp/MPI, and SSE instructions are welcome. Compiler is standard gcc, microsoft, or commonly available compiler. If the solution is compiler specific, please specific compiler and any option flag associate with your solution.
Thanks!
PS: Forgot to mention properties fun. Please assume that it is a simple function with no loop inside, and compose of only basic arithmetic operation. Simply think of it as an inline function.
EDIT2: since the details of fun is important, please forget the parameter c, d, e, f and assume fun is defined as
inline double fun(a,b, sum){
return sum + a* ( b - sum);
}
Since sum depends on its previous values in a non-trivial way, it is impossible to parallelize the code (so OpenMP and MPI are out).
Memory alignment and SSE should be enforced/used automatically with appropriate compiler settings.
Besides inlining fun and unrolling the loop (by compiling in -O3) there is not much we can do if fun is completely general.
Since fun(a,b,sum) = sum + a*(b-sum), we have the closed form
ab t+1
array[t] = ———— ( 1 - (2-a) )
a-1
which can be vectorized and parallelized, but the division and exponentiation can be very expensive.
Nevertheless, with the closed form we can start the loop from any index, e.g. create 2 threads one from t = 0 to T/2-1, another from t = T/2 to T-1, which perform the original loop, but the initial sum is computed using the above closed form solution. Also, if only a few values from the array is needed this can be computed lazily.
And for SSE, you could first fill the array first with (2-a)^(t+1), and then apply x :-> x - 1 to the whole array, and then apply x :-> x * c to the whole array where c = a*b/(1-a), but there may be automatic vectorization already.
Unless fun() is very trivial - in which case consider inline, it is likely to dominate anything else you can do to the loop.
You probably want to look at the algorithm inside fun()
one (very) minor optimization that can be done is:
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
double* pStart = array;
double* pEnd = array + T;
while(pStart < pEnd)
{
sum += fun(a,b,c,d,e,f,sum);
*pStart++ = sum;
}
this eliminates the addition of t to array for every iteration of the loop, the incrementation of t is replaced by the incrementation of pStart, for small sets of iterations(think less than 3, in that case the loop should be derolled), there is no real gain. the compiler should do this automatically, but sometimes it needs a little encouragement.
also depending on the size ranges of T it might be possible to gain performance by using a variable sized array(which would be stack allocated) or aligned _alloca
The code you have above is about as fast as you can make it.
Memory alignment is normally handled well enough by malloc anyway.
The code cannot be parallelized because f is a function of previous sums (so you can't break up the computation into chunks).
The computations are not specified so it's unclear whether SSE or CUDA or anything like that would be applicable.
Likewise, you can't perform any useful loop-unrolling based on the properties of f because we don't know what it does.
(Stylistically, I'd use array[t] since it's clearer what's going on and it is no slower.)
Edit: now that we have f(a,b,sum) = sum + a*(b-sum), we can try loop unrolling by hand to see if there's some pattern. Like so (where I'm using ** to mean "to the power of"):
sum(n) = sum(n-1) + sum(n-1) + a*(b-sum(n-1)) = (2-a)*sum(n-1) + a*b
sum(n) = (2-a)*( (2-a)*sum(n-2) + a*b ) + a*b
. . .
sum(n) = a*b*(2-a)**n + a*b*(2-a)**(n-1) + ... + a*b
sum(n) = a*b*( (2-a)**0 + (2-a)**1 + ... + (2-a)**n )
Well, now, isn't that interesting! We've converted from a recurrent formula to a geometric series! And, you may recall that the geometric series
SUM( x^n , n = 0..N ) = (x**(n+1) - 1) / (x - 1)
so that
sum(n) = a*b*( (pow(2-a,n+1) - 1) / (1-a) )
Now that you've done that math, you can start in on the sum anywhere you want (using the somewhat expensive pow computation). If you have M free processors, and your array is long, you can split it into M equal pieces, use the above computation to find the first sum, and then use the recurrence formula you were using before (with the function) to fill the rest in.
At the very least, you could calculate a*b and 2-a separately and use those instead of the existing function:
sum = ab + twonega*sum
That cuts the math in your inner loop in half, approximately.
Accept #KennyTM's answer. He is wrong to state that the computation is not parallelisable, as he goes on to show. In showing that you can rewrite your recurrence relation in closed form he illustrates a very general principle of optimising programs -- choose the best algorithm you can find and implement that. None of the micro-optimisations that the other answers suggest will come close to computing the closed form and spreading the computation across many processors in parallel.
And, lest anyone offer the suggestion that this is just an example for learning parallelisation, I contend that #KennyTM's answer still holds good -- don't learn to optimise fragments of code, learn to optimise computations. Choose the best algorithm for your purposes, implement it well and only then worry about performance.
Have a look at callgrind, part of the valgrind toolset. Run your code through that and see if anything sticks out as taking an unusually large amount of time. Then you'll know what needs optimizing. Otherwise, you're just guessing, and you (as well as the rest of us) will likely guess wrong.
Just a couple of suggestions that haven't come up yet. I'm a bit out of date when it comes to modern PC-style processors, so they may make no significant difference.
Using float might be faster than double if you can tolerate the lower precision. Integer-based fixed point might be faster still, depending on how well floating point operations are pipelined.
Counting down from T to zero (and incrementing array each iteration) might be slightly faster - certainly, on an ARM processor this would save one cycle per loop.
another very minor optimization would be to turn the for() into
while (--T)
as comparing with zero is usually faster than comparing two random integers.
I would enable vector processing on the compiler. You could rewrite the code to open up the loops but the compiler will do it for you. If it is a later version.
You could use t+array as the for loop increment... again the optimizer might do this.
means your array index won't use a multiply again optimizer might do this.
You could use the switch to dump the generated assembler code and using that see what you can change in the code to make it run faster.
Following the excellent answer by #KennyTM, I'd say the fastest way to do it sequentially should be:
double acc = 1, *array;
array = (double*) malloc(T * sizeof(double));
// the compiler should be able to derive those constant expressions, but let's do it explicitly anyway
const double k = a*b/(a-1);
const double twominusa = 2 - a;
for(int t = 0; t < T; ++t){
acc *= twominusa;
array[t] = k*(1-acc);
}
Rather than having the compiler unroll the loop, you could unroll the loop and and some data prefetching. Search the web for data driven design c++. Here is an example of loop unrolling and prefetching data:
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
// Calculate the number iterations and the
// remaining iterations.
unsigned int iterations = T / 4;
unsigned int remaining_iterations = T % 4;
double sum1;
double sum2;
double sum3;
double sum4;
double * p_array = array;
for(int t = 0; t < T; T += 4)
{
// Do some data precalculation
sum += fun(a,b,c,d,e,f,sum);
sum1 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum2 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum3 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum4 = sum;
// Do a "block" transfer to the array.
p_array[0] = sum1;
p_array[1] = sum2;
p_array[2] = sum3;
p_array[3] = sum4;
p_array += 4;
}
// Handle the few remaining calculations
for (t = 0; t < remaining_iterations; ++t)
{
sum += fun(a,b,c,d,e,f,sum);
p_array[t] = sum;
}
The big hit here is the call to the fun function. There are hidden setup and restore instructions involved when executing a function. Also, the call forces a branch which will cause instruction pipelines to be flushed and reloaded (or cause to processor to waste time in branch prediction).
Another time performance hit is the number of variables passed to the function. These variables have to be placed on the stack and copied into the function, which takes time.
Many computers have a dedicated multiply-accumulate unit implemented in the processor's hardware. Depending on your ultimate algorithm and target platform, you may be able to use this if the compiler isn't already using it when it optimizes.
The following may not be worth it, but ....
The routine fun() takes seven (7) parameters.
Changing the order of the parameters to fun (sum, a, b, c, d, e, f) could help, IF the compiler can take advantage of the following scenario. Parameters through appear to be invariant, and only appears to be changing at this level in the code. As parameters are pushed onto the stack in C/C++ from right to left, if parameters through truly are invariant, then the compiler could in theory optimize the pushing of the stack variables. In other words, through would only need to be pushed onto the stack once, and could in theory be the only parameter pushed and popped while in the loop.
I do not know if the compiler would take advantage of such a scenario, but I am tossing it out there as a possibility. Disassembling could verify it as true or false, and profiling would indicate how much of a benefit that may be if true.