Efficient (non-standard) join of two Eigen::VectorXd - c++

I have two Eigen::VectorXd objects, A and B, with the same dimension n.
I want to create a new vector C such that:
If B[i] is NaN, C[i] = A[i]
Otherwise: C[i] = B[i]
As the application is latency-sensitive, I'd like to avoid making copies of A and B.
Right now I'm using a simple for-loop but I'd like advice on how to implement this in a smart(er) way with Eigen.

Try using select:
C = (B.array() == B.array()).select(B, A);
The B==B will be true in the values that are non NaN ad false otherwise.
For the true values, select returns the first matrix, for false the second.
As noted below by chtz, a more compact way of writing this would be:
C = B.array().isNaN().select(A, B);
In terms of performance, this is not vectorized (at least last time I checked), but does not introduce copies of A and B. It's probably the same as what you wrote (as far as I can tell without seeing code).

Related

Which is the best way to index through a for loop?

I am trying to do a product operand on the values inside of a vector. It is a huge mess of code.. I have posted it previously but no one was able to help. I just wanna confirm which is the correct way to do a single part of it. I currently have:
vector<double> taylorNumerator;
for(a = 0; a <= (constant); a++) {
double Number = equation involving a to get numerous values;
taylorNumerator.push_back(Number);
for(b = 0; b <= (constant); b++) {
double NewNumber *= taylorNumerator[b];
}
This is what I have as a snapshot, it is very short from what I actually have. Someone told me it is better to do vector.at(index) instead. Which is the correct or best way to accomplish this? If you so desire I can paste all of the code, it works but the values I get are wrong.
When possible, you should probably avoid using indexes at all. Your options are:
A range-based for loop:
for (auto numerator : taylorNumerators) { ... }
An iterator-based loop:
for (auto it = taylorNumerators.begin(); it != taylorNuemrators.end(); ++it) { ... }
A standard algorithm, perhaps with a lambda:
#include <algorithm>
std::for_each(taylorNumerators, [&](double numerator) { ... });
In particular, note that some algorithms let you specify a number of iterations, like std::generate_n, so you can create exactly n items without counting to n yourself.
If you need the index in the calculation, then it can be appropriate to use a traditional for loop. You have to watch for a couple pitfalls: std::vector<T>::size() returns a std::vector<T>::size_type which is typically identical to std::size_type, which is (1) unsigned and (2) quite possibly larger than an int.
for (std::size_t i = 0; i != taylorNumerators.size(); ++i) { ... }
Your calculations probably deal with doubles or some numerical type other than std::size_t, so you have to consider the best way to convert it. Many programmers would rely on implicit conversions, but that can be dangerous unless you know the conversion rules very well. I'd generally start by doing a static cast of the index to the type I actually need. For example:
for (std::size_t i = 0; i != taylorNumerators.size(); ++i) {
const auto x = static_cast<double>(i);
/* calculation involving x */
}
In C++, it's probably far more common to make sure the index is in range and then use operator[] rather than to use at(). Many projects disable exceptions, so the safety guarantee of at() wouldn't really be available. And, if you can check the range once yourself, then it'll be faster to use operator[] than to rely on the range-check built into at() on each index operation.
What you have is fine. Modern compilers can optimize the heck out of the above such that the code is just as fast as the equivalent C code of accessing items direclty.
The only optimization for using vector I recommend is to invoke taylorNumerator.reserve(constant) to allocate the needed storage upfront instead of the vector resizing itself as new items are added.
About the only worthy optimization after that is to not use vector at all and just use a static array - especially if constant is small enough that it doesn't blow up the stack (or binary size if global).
double taylorNumerator[constant];

Fastest way to repeat a vector in Eigen c++

I have two vectors:
Eigen::Array2d A;
Eigen::Array4d B;
Basically, the vector A contains some value like
0.3
0.7
The idea is that I would like to get the vector B as follows
0.3
0.3
0.7
0.7
What is the fastest way to do that? I want the "fastest" way because I have to do this manipulation a lot of times. I know that I could use a mixture of replicate, transpose(), and Map functions to do it but it won't be so fast.
Should I use pointers, instead? Let's say the first two rows of B would point to the first row of A, and the two last rows of B would point to the last row of A? Does it make sense?
Perhaps a simple "for" loop?
Assuming your vectors contain doubles:
for(int i = 0; i < A.rows(); i++){
double cur = A(i);
B(2*i) = cur;
B(2*i + 1) = cur;
}
Eigen tends to optimize looping over vectors. The temporary is to avoid multiple access operations of A(i). You will also want to set row-major/column-major storage order correctly so memory access is as fast as possible.
I can't guarantee this is the fastest way, since I haven't benchmarked it, but my intuition tells me it would be faster than using several built-in Eigen functions together.

Eigen: Efficiently storing the output of a matrix evaluation in a raw pointer

I am using some legacy C code that passing around lots of raw pointers. To interface with the code, I have to pass a function of the form:
const int N = ...;
T * func(T * x) {
// TODO Put N elements in x
return x + N;
}
where this function should write the result into x, and then return x.
Internally, in this function, I am using Eigen extensively to perform some calculations. Then I write the result back to the raw pointer using the Map class. A simple example which mimics what I am doing is this:
const int N = 5;
T * func(T * x) {
// Do a lot of operations that result in some matrices like
Eigen::Matrix<T, N, 1 > A = ...
Eigen::Matrix<T, N, 1 > B = ...
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
return x + N;
}
Obviously, there is much more complicated stuff going on internally, but that is the gist of it... Do some calculations with Eigen, then use the Map class to write the result back to the raw pointer.
Now the problem is that when I profile this code with Callgrind, and then view the results with KCachegrind, the lines
constraint = A - B;
are almost always the bottleneck. This is sort of understandable, because such lines could/are potentially doing three things:
Constructing the Map object
Performing the calculation
Writing the result to the pointer
So it is understandable that this line might have the longest runtime. But I am a little bit worried that perhaps I am somehow doing an extra copy in that line before the data gets written to the raw pointer.
So is there a better way of writing the result to the raw pointer? Or is that the idiom I should be using?
In the back of my mind, I am wondering if using the placement new syntax would buy me anything here.
Note: This code is mission critical and should run in realtime, so I really need to squeeze every ounce of speed out of it. For instance, getting this call from a runtime of 0.12 seconds to 0.1 seconds would be huge for us. But code legibility is also a huge concern since we are constantly tweaking the model used in the internal calculations.
These two lines of code:
Eigen::Map<Eigen::Matrix<T, N, 1 >> constraint(x);
constraint = A - B;
are essentially compiled by Eigen as:
for(int i=0; i<N; ++i)
x[i] = A[i] - B[i];
The reality is a bit more complicated because of explicit unrolling, and explicit vectorization (both depends on T), but that's essentially it. So the construction of the Map object is essentially a no-op (it is optimized away by any compiler) and no, there is no extra copy going on here.
Actually, if your profiler is able to tell you that the bottleneck lies on this simple expression, then that very likely means that this piece of code has not been inlined, meaning that you did not enabled compiler optimizations flags (like -O3 with gcc/clang).

Find the element in one array, which has the minimal absolute value

I am trying to find one element in one array, which has the minimum absolute value. For example, in array [5.1, -2.2, 8.2, -1, 4, 3, -5, 6], I want get the value -1. I use following code (myarray is 1D array and not sorted)
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<fabsf(myarray[0])) myarray[0] = myarray[i];
}
Then, the target value is in myarray[0].
Because I have to repeat this procedure many times, this piece of code becomes the bottleneck in my program. Does anyone know how to improve this code? Thanks in advance!
BTW, the size of the array is always eight. Could this be used to optimize this code?
Update: so far, following code works slightly better on my machine:
float absMin = fabsf(myarray[0]); int index = 0;
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<absMin) {absMin = fabsf(myarray[i]); index=i;}
}
float result = myarray[index];
I am wandering how to avoid fabsf, because I just want to compare the absolute values instead of computing them. Does anyone have any idea?
There are some urban myths like inlining, loop unrolling by hand and similar which are supposed to make your code faster. Good news is you don't have to do it, at least if you use -O3 compiler optimization.
Bad news is, if you already use -O3 there is nothing you can do to speed up this function: the compiler will optimize the hell out of your code! For example it will surely do the caching of fabsf(myarray[0]) as some suggested. The only thing you can achieve with this "refactoring" is to build bugs into your program and make it less readable.
My advice is to look somewhere else for improvements:
try to reduce the number of invocations of this code
if this code is the bottle neck, than my guess would be that you recalculate the minimal value over and over again (otherwise filling the values into the array would take approximately the same time) - so cache the results of the search
shift costs to changing the elements of the array, for example by using some fancy data structures (heaps, priority_queue) or by tracking the minimum of elements. Lets say your array has only two elements values [1,2] so minimum is 1. Now if you change
2 to 3, you don't have to do anything
2 to 0, you can easily update your minimum to 0
1 to 3, you have to loop through all elements. But maybe this case is not that often.
Can you store the values pre fabbed?
Also as #Gerstrong mentions, storing the number outside the loop and only calculating it when array changes will give you a boost.
Calling partial_sort or nth_element will sort the array only so that the correct value is in the right location.
std::nth_element(v.begin(), v.begin(), v.end(), [](float& lhs, float& rhs){
return fabsf(lhs)<fabsf(rhs);
});
Let me give some ideas that could help:
float minVal = fabsf(myarray[0]);
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<minVal) minVal = fabsf(myarray[i]);
}
myarray[0] = minVal;
But compilers nowadays are very smart and you might not get any more speed, as you already get optimized code. It depends on how your mentioned piece of code is called.
Another way to optimize this maybe is using C++ and STL, so you can do the following using the typical binary search tree std::set:
// Absolute comparator for std::set
bool absless_compare(const int64_t &a, const int64_t &b)
{
return (fabsf(a) < fabsf(b));
}
std::set<float, absless_compare> mySet = {5.1, -2.2, 8.2, -1, 4, 3, -5, 6};
const float minVal = *(mySet.begin());
With this approach by inserting your numbers they are already sorted in ascending order. The less-Comparator is usually a set for the std::set, but you can change it to use something different like in this example. This might help on larger datasets, but you mentioned you only have eight values to compare, so it really will not help.
Eight elements is a very small number, which might be kept in stack with for example the declaration of std::array<float,8> myarray close to your sorting function before filling it with data. You should that variants on your full codeset and observe what helps. Of course if you declare std::array<float,8> myarray or float[8] myarray runtime you should get the same results.
What you also could check is if fabsf really uses float as parameter and does not convert your variable to double which would degrade the performance. There is also std::abs() which for my understanding deduces the data type, because in C++ you can use templates etc.
If don't want to use fabs obviously a call like this
float myAbs(const float val)
{
return (val<0) ? -val : val;
}
or you hack the bit to zero which make your number negative. Either way, I'm pretty sure, that fabsf is fully aware of that, and I don't think a code like that will make it faster.
So I would check if the argument is converted to double. If you have C99 Standard in your system though, you should not have that issue.
One thought would be to do your comparisons "tournament" style, instead of linearly. In other words, you first compare 1 with 2, 3 with 4, etc. Then you take those 4 elements and do the same thing, and then again, until you only have one element left.
This does not change the number of comparisons. Since each comparison eliminates one element from the running, you will have exactly 7 comparisons no matter what. So why do I suggest this? Because it removes data dependencies from your code. Modern processors have multiple pipelines and can retire multiple instructions simultaneously. However, when you do the comparisons in a loop, each loop iteration depends on the previous one. When you do it tournament style, the first four comparisons are completely independent, so the processor may be able to do them all at once.
In addition to doing that, you can compute all the fabs at once in a trivial loop and put it in a new array. Since the fabs computations are independent, this can get sped up pretty easily. You would do this first, and then the tournament style comparisons to get the index. It should be exactly the same number of operations, it's just changing the order around so that the compiler can more easily see larger blocks that lack data dependencies.
The element of an array with minimal absolute value
Let the array, A
A = [5.1, -2.2, 8.2, -1, 4, 3, -5, 6]
The minimal absolute value of A is,
double miniAbsValue = A.array().abs().minCoeff();
int i_minimum = 0; // to find the position of minimum absolute value
for(int i = 0; i < 8; i++)
{
double ftn = evalsH(i);
if( fabs(ftn) == miniAbsValue )
{
i_minimum = i;
}
}
Now the element of A with minimal absolute value is
A(i_minimum)

How much can this c/c++ loop be optimized? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am a newbie on optimization. I have been reading some reference on how to optimize c++ code but I have a hard time applying it to real code. Therefore I just want to gather some real world optimization technique on how to squeeze as much juice from the CPU/Memory as possible from the loop below
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
for(int t = 0; t < T; ++t){
sum += fun(a,b,c,d,e,f,sum);
*(array+t) = sum;
}
where a,b,c,d,e,f are double and T is int. Anything including but not limited to memory alignment, parallelism, openmp/MPI, and SSE instructions are welcome. Compiler is standard gcc, microsoft, or commonly available compiler. If the solution is compiler specific, please specific compiler and any option flag associate with your solution.
Thanks!
PS: Forgot to mention properties fun. Please assume that it is a simple function with no loop inside, and compose of only basic arithmetic operation. Simply think of it as an inline function.
EDIT2: since the details of fun is important, please forget the parameter c, d, e, f and assume fun is defined as
inline double fun(a,b, sum){
return sum + a* ( b - sum);
}
Since sum depends on its previous values in a non-trivial way, it is impossible to parallelize the code (so OpenMP and MPI are out).
Memory alignment and SSE should be enforced/used automatically with appropriate compiler settings.
Besides inlining fun and unrolling the loop (by compiling in -O3) there is not much we can do if fun is completely general.
Since fun(a,b,sum) = sum + a*(b-sum), we have the closed form
ab t+1
array[t] = ———— ( 1 - (2-a) )
a-1
which can be vectorized and parallelized, but the division and exponentiation can be very expensive.
Nevertheless, with the closed form we can start the loop from any index, e.g. create 2 threads one from t = 0 to T/2-1, another from t = T/2 to T-1, which perform the original loop, but the initial sum is computed using the above closed form solution. Also, if only a few values from the array is needed this can be computed lazily.
And for SSE, you could first fill the array first with (2-a)^(t+1), and then apply x :-> x - 1 to the whole array, and then apply x :-> x * c to the whole array where c = a*b/(1-a), but there may be automatic vectorization already.
Unless fun() is very trivial - in which case consider inline, it is likely to dominate anything else you can do to the loop.
You probably want to look at the algorithm inside fun()
one (very) minor optimization that can be done is:
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
double* pStart = array;
double* pEnd = array + T;
while(pStart < pEnd)
{
sum += fun(a,b,c,d,e,f,sum);
*pStart++ = sum;
}
this eliminates the addition of t to array for every iteration of the loop, the incrementation of t is replaced by the incrementation of pStart, for small sets of iterations(think less than 3, in that case the loop should be derolled), there is no real gain. the compiler should do this automatically, but sometimes it needs a little encouragement.
also depending on the size ranges of T it might be possible to gain performance by using a variable sized array(which would be stack allocated) or aligned _alloca
The code you have above is about as fast as you can make it.
Memory alignment is normally handled well enough by malloc anyway.
The code cannot be parallelized because f is a function of previous sums (so you can't break up the computation into chunks).
The computations are not specified so it's unclear whether SSE or CUDA or anything like that would be applicable.
Likewise, you can't perform any useful loop-unrolling based on the properties of f because we don't know what it does.
(Stylistically, I'd use array[t] since it's clearer what's going on and it is no slower.)
Edit: now that we have f(a,b,sum) = sum + a*(b-sum), we can try loop unrolling by hand to see if there's some pattern. Like so (where I'm using ** to mean "to the power of"):
sum(n) = sum(n-1) + sum(n-1) + a*(b-sum(n-1)) = (2-a)*sum(n-1) + a*b
sum(n) = (2-a)*( (2-a)*sum(n-2) + a*b ) + a*b
. . .
sum(n) = a*b*(2-a)**n + a*b*(2-a)**(n-1) + ... + a*b
sum(n) = a*b*( (2-a)**0 + (2-a)**1 + ... + (2-a)**n )
Well, now, isn't that interesting! We've converted from a recurrent formula to a geometric series! And, you may recall that the geometric series
SUM( x^n , n = 0..N ) = (x**(n+1) - 1) / (x - 1)
so that
sum(n) = a*b*( (pow(2-a,n+1) - 1) / (1-a) )
Now that you've done that math, you can start in on the sum anywhere you want (using the somewhat expensive pow computation). If you have M free processors, and your array is long, you can split it into M equal pieces, use the above computation to find the first sum, and then use the recurrence formula you were using before (with the function) to fill the rest in.
At the very least, you could calculate a*b and 2-a separately and use those instead of the existing function:
sum = ab + twonega*sum
That cuts the math in your inner loop in half, approximately.
Accept #KennyTM's answer. He is wrong to state that the computation is not parallelisable, as he goes on to show. In showing that you can rewrite your recurrence relation in closed form he illustrates a very general principle of optimising programs -- choose the best algorithm you can find and implement that. None of the micro-optimisations that the other answers suggest will come close to computing the closed form and spreading the computation across many processors in parallel.
And, lest anyone offer the suggestion that this is just an example for learning parallelisation, I contend that #KennyTM's answer still holds good -- don't learn to optimise fragments of code, learn to optimise computations. Choose the best algorithm for your purposes, implement it well and only then worry about performance.
Have a look at callgrind, part of the valgrind toolset. Run your code through that and see if anything sticks out as taking an unusually large amount of time. Then you'll know what needs optimizing. Otherwise, you're just guessing, and you (as well as the rest of us) will likely guess wrong.
Just a couple of suggestions that haven't come up yet. I'm a bit out of date when it comes to modern PC-style processors, so they may make no significant difference.
Using float might be faster than double if you can tolerate the lower precision. Integer-based fixed point might be faster still, depending on how well floating point operations are pipelined.
Counting down from T to zero (and incrementing array each iteration) might be slightly faster - certainly, on an ARM processor this would save one cycle per loop.
another very minor optimization would be to turn the for() into
while (--T)
as comparing with zero is usually faster than comparing two random integers.
I would enable vector processing on the compiler. You could rewrite the code to open up the loops but the compiler will do it for you. If it is a later version.
You could use t+array as the for loop increment... again the optimizer might do this.
means your array index won't use a multiply again optimizer might do this.
You could use the switch to dump the generated assembler code and using that see what you can change in the code to make it run faster.
Following the excellent answer by #KennyTM, I'd say the fastest way to do it sequentially should be:
double acc = 1, *array;
array = (double*) malloc(T * sizeof(double));
// the compiler should be able to derive those constant expressions, but let's do it explicitly anyway
const double k = a*b/(a-1);
const double twominusa = 2 - a;
for(int t = 0; t < T; ++t){
acc *= twominusa;
array[t] = k*(1-acc);
}
Rather than having the compiler unroll the loop, you could unroll the loop and and some data prefetching. Search the web for data driven design c++. Here is an example of loop unrolling and prefetching data:
double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
// Calculate the number iterations and the
// remaining iterations.
unsigned int iterations = T / 4;
unsigned int remaining_iterations = T % 4;
double sum1;
double sum2;
double sum3;
double sum4;
double * p_array = array;
for(int t = 0; t < T; T += 4)
{
// Do some data precalculation
sum += fun(a,b,c,d,e,f,sum);
sum1 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum2 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum3 = sum;
sum += fun(a,b,c,d,e,f,sum);
sum4 = sum;
// Do a "block" transfer to the array.
p_array[0] = sum1;
p_array[1] = sum2;
p_array[2] = sum3;
p_array[3] = sum4;
p_array += 4;
}
// Handle the few remaining calculations
for (t = 0; t < remaining_iterations; ++t)
{
sum += fun(a,b,c,d,e,f,sum);
p_array[t] = sum;
}
The big hit here is the call to the fun function. There are hidden setup and restore instructions involved when executing a function. Also, the call forces a branch which will cause instruction pipelines to be flushed and reloaded (or cause to processor to waste time in branch prediction).
Another time performance hit is the number of variables passed to the function. These variables have to be placed on the stack and copied into the function, which takes time.
Many computers have a dedicated multiply-accumulate unit implemented in the processor's hardware. Depending on your ultimate algorithm and target platform, you may be able to use this if the compiler isn't already using it when it optimizes.
The following may not be worth it, but ....
The routine fun() takes seven (7) parameters.
Changing the order of the parameters to fun (sum, a, b, c, d, e, f) could help, IF the compiler can take advantage of the following scenario. Parameters through appear to be invariant, and only appears to be changing at this level in the code. As parameters are pushed onto the stack in C/C++ from right to left, if parameters through truly are invariant, then the compiler could in theory optimize the pushing of the stack variables. In other words, through would only need to be pushed onto the stack once, and could in theory be the only parameter pushed and popped while in the loop.
I do not know if the compiler would take advantage of such a scenario, but I am tossing it out there as a possibility. Disassembling could verify it as true or false, and profiling would indicate how much of a benefit that may be if true.