I have two vectors:
Eigen::Array2d A;
Eigen::Array4d B;
Basically, the vector A contains some value like
0.3
0.7
The idea is that I would like to get the vector B as follows
0.3
0.3
0.7
0.7
What is the fastest way to do that? I want the "fastest" way because I have to do this manipulation a lot of times. I know that I could use a mixture of replicate, transpose(), and Map functions to do it but it won't be so fast.
Should I use pointers, instead? Let's say the first two rows of B would point to the first row of A, and the two last rows of B would point to the last row of A? Does it make sense?
Perhaps a simple "for" loop?
Assuming your vectors contain doubles:
for(int i = 0; i < A.rows(); i++){
double cur = A(i);
B(2*i) = cur;
B(2*i + 1) = cur;
}
Eigen tends to optimize looping over vectors. The temporary is to avoid multiple access operations of A(i). You will also want to set row-major/column-major storage order correctly so memory access is as fast as possible.
I can't guarantee this is the fastest way, since I haven't benchmarked it, but my intuition tells me it would be faster than using several built-in Eigen functions together.
Related
I have two Eigen::VectorXd objects, A and B, with the same dimension n.
I want to create a new vector C such that:
If B[i] is NaN, C[i] = A[i]
Otherwise: C[i] = B[i]
As the application is latency-sensitive, I'd like to avoid making copies of A and B.
Right now I'm using a simple for-loop but I'd like advice on how to implement this in a smart(er) way with Eigen.
Try using select:
C = (B.array() == B.array()).select(B, A);
The B==B will be true in the values that are non NaN ad false otherwise.
For the true values, select returns the first matrix, for false the second.
As noted below by chtz, a more compact way of writing this would be:
C = B.array().isNaN().select(A, B);
In terms of performance, this is not vectorized (at least last time I checked), but does not introduce copies of A and B. It's probably the same as what you wrote (as far as I can tell without seeing code).
I have n number of vectors (single column), which are correlated. One of them (let's say the first of the n vectors) has a bunch of NaNs in it. I have used the erase - remove_if idiom to clear this vector of the rows which contain the NaNs. I want to remove the exact same row from all the other vectors either simultaneously or after the fact. This seems like it would be a common coding problem, but I can't find an example. I'm coding in C++, with OpenCV libraries.
Here is my code sample that doesn't work, which I think is kind of what Miki is suggesting
vector<float> RemoveManyEs(vector<float> &V1, vector<float> &V2, vector<float> &V3)
{
int length = V1.size();
int n = 0;
do
{
if (isnan(V1.at(n)))
{
V1.erase(V1.begin() + n);
V2.erase(V2.begin() + n);
V3.erase(V3.begin() + n);
}
n += 1;
} while (n < length);
return V1,V2,V3;
}
Consider whether or not you really need to delete the rows or whether it would suffice to set them to zero. If less than 10% of the rows need to be deleted, then removing them would have a negligible effect on run-time (and could even make things take longer overall after the overhead of reallocation).
If you are computing correlation matrices for example, then you'll get the same result by zeroing out all elements. This is generally a much simpler operation and will give you the result you need for most applications.
I am trying to find one element in one array, which has the minimum absolute value. For example, in array [5.1, -2.2, 8.2, -1, 4, 3, -5, 6], I want get the value -1. I use following code (myarray is 1D array and not sorted)
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<fabsf(myarray[0])) myarray[0] = myarray[i];
}
Then, the target value is in myarray[0].
Because I have to repeat this procedure many times, this piece of code becomes the bottleneck in my program. Does anyone know how to improve this code? Thanks in advance!
BTW, the size of the array is always eight. Could this be used to optimize this code?
Update: so far, following code works slightly better on my machine:
float absMin = fabsf(myarray[0]); int index = 0;
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<absMin) {absMin = fabsf(myarray[i]); index=i;}
}
float result = myarray[index];
I am wandering how to avoid fabsf, because I just want to compare the absolute values instead of computing them. Does anyone have any idea?
There are some urban myths like inlining, loop unrolling by hand and similar which are supposed to make your code faster. Good news is you don't have to do it, at least if you use -O3 compiler optimization.
Bad news is, if you already use -O3 there is nothing you can do to speed up this function: the compiler will optimize the hell out of your code! For example it will surely do the caching of fabsf(myarray[0]) as some suggested. The only thing you can achieve with this "refactoring" is to build bugs into your program and make it less readable.
My advice is to look somewhere else for improvements:
try to reduce the number of invocations of this code
if this code is the bottle neck, than my guess would be that you recalculate the minimal value over and over again (otherwise filling the values into the array would take approximately the same time) - so cache the results of the search
shift costs to changing the elements of the array, for example by using some fancy data structures (heaps, priority_queue) or by tracking the minimum of elements. Lets say your array has only two elements values [1,2] so minimum is 1. Now if you change
2 to 3, you don't have to do anything
2 to 0, you can easily update your minimum to 0
1 to 3, you have to loop through all elements. But maybe this case is not that often.
Can you store the values pre fabbed?
Also as #Gerstrong mentions, storing the number outside the loop and only calculating it when array changes will give you a boost.
Calling partial_sort or nth_element will sort the array only so that the correct value is in the right location.
std::nth_element(v.begin(), v.begin(), v.end(), [](float& lhs, float& rhs){
return fabsf(lhs)<fabsf(rhs);
});
Let me give some ideas that could help:
float minVal = fabsf(myarray[0]);
for (int i = 1; i < 8; ++i)
{
if(fabsf(myarray[i])<minVal) minVal = fabsf(myarray[i]);
}
myarray[0] = minVal;
But compilers nowadays are very smart and you might not get any more speed, as you already get optimized code. It depends on how your mentioned piece of code is called.
Another way to optimize this maybe is using C++ and STL, so you can do the following using the typical binary search tree std::set:
// Absolute comparator for std::set
bool absless_compare(const int64_t &a, const int64_t &b)
{
return (fabsf(a) < fabsf(b));
}
std::set<float, absless_compare> mySet = {5.1, -2.2, 8.2, -1, 4, 3, -5, 6};
const float minVal = *(mySet.begin());
With this approach by inserting your numbers they are already sorted in ascending order. The less-Comparator is usually a set for the std::set, but you can change it to use something different like in this example. This might help on larger datasets, but you mentioned you only have eight values to compare, so it really will not help.
Eight elements is a very small number, which might be kept in stack with for example the declaration of std::array<float,8> myarray close to your sorting function before filling it with data. You should that variants on your full codeset and observe what helps. Of course if you declare std::array<float,8> myarray or float[8] myarray runtime you should get the same results.
What you also could check is if fabsf really uses float as parameter and does not convert your variable to double which would degrade the performance. There is also std::abs() which for my understanding deduces the data type, because in C++ you can use templates etc.
If don't want to use fabs obviously a call like this
float myAbs(const float val)
{
return (val<0) ? -val : val;
}
or you hack the bit to zero which make your number negative. Either way, I'm pretty sure, that fabsf is fully aware of that, and I don't think a code like that will make it faster.
So I would check if the argument is converted to double. If you have C99 Standard in your system though, you should not have that issue.
One thought would be to do your comparisons "tournament" style, instead of linearly. In other words, you first compare 1 with 2, 3 with 4, etc. Then you take those 4 elements and do the same thing, and then again, until you only have one element left.
This does not change the number of comparisons. Since each comparison eliminates one element from the running, you will have exactly 7 comparisons no matter what. So why do I suggest this? Because it removes data dependencies from your code. Modern processors have multiple pipelines and can retire multiple instructions simultaneously. However, when you do the comparisons in a loop, each loop iteration depends on the previous one. When you do it tournament style, the first four comparisons are completely independent, so the processor may be able to do them all at once.
In addition to doing that, you can compute all the fabs at once in a trivial loop and put it in a new array. Since the fabs computations are independent, this can get sped up pretty easily. You would do this first, and then the tournament style comparisons to get the index. It should be exactly the same number of operations, it's just changing the order around so that the compiler can more easily see larger blocks that lack data dependencies.
The element of an array with minimal absolute value
Let the array, A
A = [5.1, -2.2, 8.2, -1, 4, 3, -5, 6]
The minimal absolute value of A is,
double miniAbsValue = A.array().abs().minCoeff();
int i_minimum = 0; // to find the position of minimum absolute value
for(int i = 0; i < 8; i++)
{
double ftn = evalsH(i);
if( fabs(ftn) == miniAbsValue )
{
i_minimum = i;
}
}
Now the element of A with minimal absolute value is
A(i_minimum)
I am given
struct point
{
int x;
int y;
};
and the table of points:
point tab[MAX];
Program should return the minimal distance between the centers of gravity of any possible pair of subsets from tab. Subset can be any size (of course >=1 and < MAX).
I am obliged to write this program using recursion.
So my function will be int type because I have to return int.
I globally set variable min (because while doing recurssion I have to compare some values with this min)
int min = 0;
My function should for sure, take number of elements I add, sum of Y coordinates and sum of X coordinates.
int return_min_distance(int sY, int sX, int number, bool iftaken[])
I will be glad for any help further.
I thought about another table of bools which I pass as a parameter to determine if I took value or not from table. Still my problem is how to implement this, I do not know how to even start.
I think you need a function that can iterate through all subsets of the table, starting with either nothing or an existing iterator. The code then gets easy:
int min_distance = MAXINT;
SubsetIterator si1(0, tab);
while (si1.hasNext())
{
SubsetIterator si2(&si1, tab);
while (si2.hasNext())
{
int d = subsetDistance(tab, si1.subset(), si2.subset());
if (d < min_distance)
{
min_distance = d;
}
}
}
The SubsetIterators can be simple base-2 numbers capable of counting up to MAX, where a 1 bit indicates membership in the subset. Yes, it's a O(N^2) algorithm, but I think it has to be.
The trick is incorporating recursion. Sorry, I just don't see how it helps here. If I can think of a way to use it, I'll edit my answer.
Update: I thought about this some more, and while I still can't see a use for recursion, I found a way to make the subset processing easier. Rather than run through the entire table for every distance computation, the SubsetIterators could store precomputed sums of the x and y values for easy distance computation. Then, on every iteration, you subtract the values that are leaving the subset and add the values that are joining. A simple bit-and operation can reveal these. To be even more efficient, you could use gray coding instead of two's complement to store the membership bitmap. This would guarantee that at each iteration exactly one value enters and/or leaves the subset. Minimal work.
I have a class containing a number of double values. This is stored in a vector where the indices for the classes are important (they are referenced from elsewhere). The class looks something like this:
Vector of classes
class A
{
double count;
double val;
double sumA;
double sumB;
vector<double> sumVectorC;
vector<double> sumVectorD;
}
vector<A> classes(10000);
The code that needs to run as fast as possible is something like this:
vector<double> result(classes.size());
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
The alternative is instead of one giant loop, split the computation into two separate loops such as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classes[i].sumA;
}
for(int i = 0; i < classes.size(); i++)
{
vector<double>::iterator it = find(classes[i].sumVectorC.begin(), classes[i].sumVectorC.end(), testval);
if(it != classes[i].sumVectorC.end())
result[i] += *it;
}
or to store each member of the class in a vector like so:
Class of vectors
vector<double> classCounts;
vector<double> classVal;
...
vector<vector<double> > classSumVectorC;
...
and then operate as:
for(int i = 0; i < classes.size(); i++)
{
result[i] += classCounts[i];
...
}
Which way would usually be faster (across x86/x64 platforms and compilers)? Are look-ahead and cache lines are the most important things to think about here?
Update
The reason I'm doing a linear search (i.e. find) here and not a hash map or binary search is because the sumVectors are very short, around 4 or 5 elements. Profiling showed a hash map was slower and a binary search was slightly slower.
As the implementation of both variants seems easy enough I would build both versions and profile them to find the fastest one.
Empirical data usually beats speculation.
As a side issue: Currently, the find() in your innermost loop does a linear scan through all elements of classes[i].sumVectorC until it finds a matching value. If that vector contains many values, and you have no reason to believe that testVal appears near the start of the vector, then this will be slow -- consider using a container type with faster lookup instead (e.g. std::map or one of the nonstandard but commonly implemented hash_map types).
As a general guideline: consider algorithmic improvements before low-level implementation optimisation.
As lothar says, you really should test it out. But to answer your last question, yes, cache misses will be a major concern here.
Also, it seems that your first implementation would run into load-hit-store stalls as coded, but I'm not sure how much of a problem that is on x86 (it's a big problem on XBox 360 and PS3).
It looks like optimizing the find() would be a big win (profile to know for sure). Depending on the various sizes, in addition to replacing the vector with another container, you could try sorting sumVectorC and using a binary search in the form of lower_bound. This will turn your linear search O(n) into O(log n).
if you can guarrantee that std::numeric_limits<double>::infinity is not a possible value, ensuring that the arrays are sorted with a dummy infinite entry at the end and then manually coding the find so that the loop condition is a single test:
array[i]<test_val
and then an equality test.
then you know that the average number of looked at values is (size()+1)/2 in the not found case. Of course if the search array changes very frequently then the issue of keeping it sorted is an issue.
of course you don't tell us much about sumVectorC or the rest of A for that matter, so it is hard to ascertain and give really good advice. For example if sumVectorC is never updates then it is probably possible to find an EXTREMELY cheap hash (eg cast ULL and bit extraction) that is perfect on the sumVectorC values that fits into double[8]. Then the overhead is bit extract and 1 comparison versus 3 or 6
Also if you have a bound on sumVectorC.size() that is reasonable(you mentioned 4 or 5 so this assumption seems not bad) you could consider using an aggregated array or even just a boost::array<double> and add your own dynamic size eg :
class AggregatedArray : public boost::array<double>{
size_t _size;
size_t size() const {
return size;
}
....
push_back(..){...
pop(){...
resize(...){...
};
this gets rid of the extra cache line access to the allocated array data for sumVectorC.
In the case of sumVectorC very infrequently updating if finding a perfect hash (out of your class of hash algoithhms)is relatively cheap then you can incur that with profit when sumVectorC changes. These small lookups can be problematic and algorithmic complexity is frequently irrelevant - it is the constants that dominate. It is an engineering problem and not a theoretical one.
Unless you can guarantee that the small maps are in cache you can be almost be guaranteed that using a std::map will yield approximately 130% worse performance as pretty much each node in the tree will be in a separate cache line
Thus instead of accessing (4 times 1+1 times 2)/5 = 1.2 cache lines per search (the first 4 are in first cacheline, the 5th in the second cacheline, you will access (1 + 2 times 2 + 2 times 3) = 9/5) + 1 for the tree itself = 2.8 cachelines per search (the 1 being 1 node at the root, 2 nodes being children of the root, and the last 2 being grandchildren of the root, plus the tree itself)
So I would predict using a std::map to take 2.8/1.2 = 233% as long for a sumVectorC having 5 entries
This what I meant when I said: "It is an engineering problem and not a theoretical one."