std::transform slower than for loops

std::transform slower than for loops - c++

I thought about implementing a matrix class that used std::transform from algorithm for calculation but I came across that in some situations it's faster to write loops.
Having a look add operator+= for element wise add. In case the rhs matrix has 1 col while having the same number of rows than the lhs matrix I can do the following:
for (auto c = 0; c < cols(); ++c) {
std::transform(std::execution::par, col_begin(c), col_end(c), rhs.begin(), col_begin(c), std::plus<>());
}
or use simple loops:
auto lhsval = begin();
auto rhsval= rhs.begin();
for (auto r = 0; r < rows(); ++r) {
for (auto c = 0; c < cols(); ++c) {
*lhsval += *rhsval;
++lhsval;
}
++rhsval;
}
For your information, i wrote an iterator that accepts a step. So the col_begin() returns an iterator that will skip other columns in the operator++
I timed the difference between both implementations using google benchmark and came to the conclusion that the loop is about 5 times faster than using std::transform. Well maybe there should be a difference, but not a difference that huge.
You can look at the complete code at my github repo
matrix class
matrix iterator

Passing std::execution::par is asking the library to parallelize this operation. This adds overhead, even if it is just to determine "your problem is too small to parallelize". The number of elements being transformed has to be quite large (sometimes hundreds of thousands or millions) before the parallelization is worthwhile, and requires that you have appropriate hardware (parallelizing on a two-core machine is much less likely to be worth it than on a 64-core machine).
The for loop version is much more similar to plain std::transform without the std::execution::par parameter. If you remove that parameter and the performance difference is still large, please update your question with that information, alongside your compiler version, platform, compiler switches and information about your data set: number of rows/columns, etc.

Related

Which c++ implementation is preferable, range based loop or count_if

I read the question which implementation is preferable to perform a count of some vector items.
Is this better than
auto countif = [] (T t) { return t.countable(); };
const int count = std::count_if(v.begin(), v.end(), countif);
return count ;
this
int count = 0;
for ( auto& t : v )
if (t.countable()) count++;
The question has been voted down and thus been deleted.

You should almost always use an algorithm like std::count_if if one is available.
The reason is that the compiler vendor can put optimizations in that are not portable if you were to put them in manually in your own loop. For example there are intrinsic functions that could be CPU specific that speed up even basic tasks like counting values in an array.
Unless you have a specific need to use non-portable optimizations then algorithms provided by the compiler in the standard library are likely to be faster in a portable way than something you are likely to write.

How to use algorithms to fill vector of vectors

I have
typedef std::vector<int> IVec;
typedef std::vector<IVec> IMat;
and I would like to know how I can fill an IMat by using std algorithms, ie how to do the following with less code (all the IVecs have the same size) ?
void fill(IMat& mat){
for (int i=0;i<mat.size();i++){
for (int j=0;j<mat[i].size();j++){
mat[i][j] = i*j;
}
}
}
PS: already a way to fill the matrix with a constant would help me. And preferably with pre-C++11 algorithms.

The best solution is the one that you have already implemented. It takes advantage of using i/j as both offsets and as inputs to compute the algorithm.
Standard algorithms will have to use iterators for the elements and maintain counters. This data mirroring as a sure sign of a problem. But it can be done, even on one line if you wanna be fancy:
for_each(mat.begin(), mat.end(), [&](auto& i) { static auto row = 0; auto column = 0; generate(i.begin(), i.end(), [&]() { return row * column++; }); ++row; });
But as stated just cause it could be done doesn't mean that it should be done. The best way to approach this is the for-loop. Even doing it on one line is possible if that's your thing:
for(auto i = 0U;i < mat.size();i++) for(auto j = 0U;j < mat[i].size();j++) mat[i][j] = i*j;
Incidentally my standard algorithm works fine on Clang 3.7.0, gcc 5.1, and on Visual Studio 2015. However previously I used transform rather than generate. And there seem to be some implementation bugs in gcc 5.1 and Visual Studio 2015 with the captures of lambda scope static variables.

I don't know if this is better than a double for loop, but one possible way you could do it using STL in C++11 would be using two for_each as follows:
int i(0);
std::for_each(mat.begin(), mat.end(),
[&i](IVec &ivec){int j(0); std::for_each(ivec.begin(), ivec.end(),
[&i,&j](auto &k){k = i*j++;}); ++i;});
LIVE DEMO

Just thought I'd comment further on Jonathan's excellent answer.
Ignore the c++11 syntax for now and imagine that we had written some supporting classes (doesn't matter how for now).
we could conceivably come up with code like this:
auto main() -> int
{
// define a matrix (vector of vectors)
IMat mat;
// resize it through some previously defined function
resize(mat, 10, 10);
// get an object that is a pseudo-container representing its extent
auto extent = extent_of(mat);
// generate values in the pseudo-container which forwards to the matrix
std::generate(extent.begin(),
extent.end(),
[](auto pxy) { pxy.set_value(pxy.x * pxy.y); });
// or even
for (auto pxy : extent_of(mat)) {
pxy.set_value(product(pxy.coordinates()));
}
return 0;
}
100 lines of supporting code later (iterable containers and their proxies are not trivial) and this would compile and work.
Clever as it undoubtedly would be, there are some problems:
There's the small matter of the 100 extra lines of code.
It seems to me that this code is actually less expressive than yours. i.e. it's immediately obvious what your code is doing. With mine you have to make some assumptions or go and reason about the extra 100 lines of code.
my code needs a lot more maintenance (and documentation) than yours
Sometimes less is more.

Swap columns in c++

I have an std matrix defined as:
std::vector<std::vector<double> > Qe(6,std::vector<double>(6));
and a vector v that is:
v{0, 1, 3, 2, 4, 5};
I would like to swap the columns 3 and 2 of matrix Qe like indicated in vector v.
In Matlab this is as easy as writing Qe=Qe(:,v);
I wonder if there is an easy way other than a for loop to do this in c++.
Thanks in advance.

Given that you've implemented this as a vector of vectors, you can use a simple swap:
std::swap(Qe[2], Qe[3]);
This should have constant complexity. Of course, this will depend on whether you're treating your data as column-major or row-major. If you're going to be swapping columns often, however, you'll want to arrange the data to suit that (i.e., to allow the code above to work).
As far as doing the job without a for loop when you're using row-major ordering (the usual for C++), you can technically eliminate the for loop (at least from your source code) by using a standard algorithm instead:
std::for_each(Qe.begin(), Qe.end(), [](std::vector<double> &v) {std::swap(v[2], v[3]); });
This doesn't really change what's actually happening though--it just hides the for loop itself inside a standard algorithm. In this case, I'd probably prefer a range-based for loop:
for (auto &v : Qe)
std::swap(v[2], v[3]);
...but I've never been particularly fond of std::for_each, and when C++11 added range-based for loops, I think that was a superior alternative to the vast majority of cases where std::for_each might previously have been a reasonable possibility (IOW, I've never seen much use for std::for_each, and see almost none now).

Depends on how you implement your matrix.
If you have a vector of columns, you can swap the column references. O(1)
If you have a vector of rows, you need to swap the elements inside each row using a for loop. O(n)
std::vector<std::vector<double>> can be used as a matrix but you also need to define for yourself whether it is a vector of columns or vector of rows.
You can create a function for this so you don't write a for loop each time. For example, you can write a function which receives a matrix which is a vector of columns and a reordering vector (like v) and based on the reordering vector you create a new matrix.
//untested code and inefficient, just an example:
vector<vector<double>> ReorderColumns(vector<vector<double>> A, vector<int> order)
{
vector<vector<double>> B;
for (int i=0; i<order.size(); i++)
{
B[i] = A[order[i]];
}
return B;
}
Edit: If you want to do linear algebra there are libraries that can help you, you don't need to write everything yourself. There are math libraries for other purposes too.

If you are in a row scenario. The following would probably work:
// To be tested
std::vector<std::vector<double> >::iterator it;
for (it = Qe.begin(); it != Qe.end(); ++it)
{
std::swap((it->second)[2], (it->second)[3]);
}
In this scenario I don't see any other solution that would avoid doing a loop O(n).

Which one is more optimized for accessing array?

Solving the following exercise:
Write three different versions of a program to print the elements of
ia. One version should use a range for to manage the iteration, the
other two should use an ordinary for loop in one case using subscripts
and in the other using pointers. In all three programs write all the
types directly. That is, do not use a type alias, auto, or decltype to
simplify the code.[C++ Primer]
a question came up: Which of these methods for accessing array is optimized in terms of speed and why?
My Solutions:
Foreach Loop:
int ia[3][4]={{1,2,3,4},{5,6,7,8},{9,10,11,12}};
for (int (&i)[4]:ia) //1st method using for each loop
for(int j:i)
cout<<j<<" ";
Nested for loops:
for (int i=0;i<3;i++) //2nd method normal for loop
for(int j=0;j<4;j++)
cout<<ia[i][j]<<" ";
Using pointers:
int (*i)[4]=ia;
for(int t=0;t<3;i++,t++){ //3rd method. using pointers.
for(int x=0;x<4;x++)
cout<<(*i)[x]<<" ";
Using auto:
for(auto &i:ia) //4th one using auto but I think it is similar to 1st.
for(auto j:i)
cout<<j<<" ";
Benchmark result using clock()
1st: 3.6 (6,4,4,3,2,3)
2nd: 3.3 (6,3,4,2,3,2)
3rd: 3.1 (4,2,4,2,3,4)
4th: 3.6 (4,2,4,5,3,4)
Simulating each method 1000 times:
1st: 2.29375 2nd: 2.17592 3rd: 2.14383 4th: 2.33333
Process returned 0 (0x0) execution time : 13.568 s
Compiler used:MingW 3.2 c++11 flag enabled. IDE:CodeBlocks

I have some observations and points to make and I hope you get your answer from this.
The fourth version, as you mention yourself, is basically the same as the first version. auto can be thought of as only a coding shortcut (this is of course not strictly true, as using auto can result in getting different types than you'd expected and therefore result in different runtime behavior. But most of the time this is true.)
Your solution using pointers is probably not what people mean when they say that they are using pointers! One solution might be something like this:
for (int i = 0, *p = &(ia[0][0]); i < 3 * 4; ++i, ++p)
cout << *p << " ";
or to use two nested loops (which is probably pointless):
for (int i = 0, *p = &(ia[0][0]); i < 3; ++i)
for (int j = 0; j < 4; ++j, ++p)
cout << *p << " ";
from now on, I'm assuming this is the pointer solution you've written.
In such a trivial case as this, the part that will absolutely dominate your running time is the cout. The time spent in bookkeeping and checks for the loop(s) will be completely negligible comparing to doing I/O. Therefore, it won't matter which loop technique you use.
Modern compilers are great at optimizing such ubiquitous tasks and access patterns (iterating over an array.) Therefore, chances are that all these methods will generate exactly the same code (with the possible exception of the pointer version, which I will talk about later.)
The performance of most codes like this will depend more on the memory access pattern rather than how exactly the compiler generates the assembly branch instructions (and the rest of the operations.) This is because if a required memory block is not in the CPU cache, it's going to take a time roughly equivalent of several hundred CPU cycles (this is just a ballpark number) to fetch those bytes from RAM. Since all the examples access memory in exactly the same order, their behavior in respect to memory and cache will be the same and will have roughly the same running time.
As a side note, the way these examples access memory is the best way for it to be accessed! Linear, consecutive and from start to finish. Again, there are problems with the cout in there, which can be a very complicated operation and even call into the OS on every invocation, which might result, among other things, an almost complete deletion (eviction) of everything useful from the CPU cache.
On 32-bit systems and programs, the size of an int and a pointer are usually equal (both are 32 bits!) Which means that it doesn't matter much whether you pass around and use index values or pointers into arrays. On 64-bit systems however, a pointer is 64 bits but an int will still usually be 32 bits. This suggests that it is usually better to use indexes into arrays instead of pointers (or even iterators) on 64-bit systems and programs.
In this particular example, this is not significant at all though.
Your code is very specific and simple, but the general case, it is almost always better to give as much information to the compiler about your code as possible. This means that you must use the narrowest, most specific device available to you to do a job. This in turn means that a generic for loop (i.e. for (int i = 0; i < n; ++i)) is worse than a range-based for loop (i.e. for (auto i : v)) for the compiler, because in the latter case the compiler simply knows that you are going to iterate over the whole range and not go outside of it or break out of the loop or something, while in the generic for loop case, specially if your code is more complex, the compiler cannot be sure of this and has to insert extra checks and tests to make sure the code executes as the C++ standard says it should.
In many (most?) cases, although you might think performance matters, it does not. And most of the time you rewrite something to gain performance, you don't gain much. And most of the time the performance gain you get is not worth the loss in readability and maintainability that you sustain. So, design your code and data structures right (and keep performance in mind) but avoid this kind of "micro-optimization" because it's almost always not worth it and even harms the quality of the code too.
Generally, performance in terms of speed is very hard to reason about. Ideally you have to measure the time with real data on real hardware in real working conditions using sound scientific measuring and statistical methods. Even measuring the time it takes a piece of code to run is not at all trivial. Measuring performance is hard, and reasoning about it is harder, but these days it is the only way of recognizing bottlenecks and optimizing the code.
I hope I have answered your question.
EDIT: I wrote a very simple benchmark for what you are trying to do. The code is here. It's written for Windows and should be compilable on Visual Studio 2012 (because of the range-based for loops.) And here are the timing results:
Simple iteration (nested loops): min:0.002140, avg:0.002160, max:0.002739
Simple iteration (one loop): min:0.002140, avg:0.002160, max:0.002625
Pointer iteration (one loop): min:0.002140, avg:0.002160, max:0.003149
Range-based for (nested loops): min:0.002140, avg:0.002159, max:0.002862
Range(const ref)(nested loops): min:0.002140, avg:0.002155, max:0.002906
The relevant numbers are the "min" times (over 2000 runs of each test, for 1000x1000 arrays.) As you see, there is absolutely no difference between the tests. Note that you should turn on compiler optimizations or test 2 will be a disaster and cases 4 and 5 will be a little worse than 1 and 3.
And here are the code for the tests:
// 1. Simple iteration (nested loops)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows; ++i)
for (unsigned j = 0; j < gc_Cols; ++j)
sum += g_Data[i][j];
// 2. Simple iteration (one loop)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += g_Data[i / gc_Cols][i % gc_Cols];
// 3. Pointer iteration (one loop)
unsigned sum = 0;
unsigned * p = &(g_Data[0][0]);
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += *p++;
// 4. Range-based for (nested loops)
unsigned sum = 0;
for (auto & i : g_Data)
for (auto j : i)
sum += j;
// 5. Range(const ref)(nested loops)
unsigned sum = 0;
for (auto const & i : g_Data)
for (auto const & j : i)
sum += j;

It has many factors affecting it:
It depends on the compiler
It depends on the compiler flags used
It depends on the computer used
There is only one way to know the exact answer: measuring the time used when dealing with huge arrays (maybe from a random number generator) which is the same method you have already done except that the array size should be at least 1000x1000.

How to use std::accumulate to neatly sum values in a vector pointed by separately defined indices (replacing loops)

I was wondering if there's a neater (or better yet, more efficient), method of summing values of a vector/(asymmetric) matrix (a matrix having structure like symmetry, could of course be exploited in looping, but not that pertinent to my question) pointed by a collection of indices. Basically this code could be used to calculate, say, a cost of a route through a 2D matrix. I'm looking for a way to utilize CPU, not GPU.
Here's some relevant code, the one I'm more interested is the first case. I was thinking it's possible to use std::accumulate with a lambda to capture the indices vector, but then I got wondering, if there's already a neater way, perhaps with some other operator. Not a "real problem" as looping is quite clear for my tastes too, but in hunt for the super-neat or more efficient on-liner...
template<typename out_type>
out_type sum(std::vector<float> const& matrix, std::vector<int> const& indices)
{
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; ++i)
{
const int index = indices.size() * indices[i] + indices[i + 1];
cost += matrix[index];
}
const int index = indices.size() * indices[indices.size() - 1] + indices[0];
cost += matrix[index];
return cost;
}
template<typename out_type>
out_type sum(std::vector<std::vector<float>> const& matrix, std::vector<int> const& indices)
{
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++)
{
cost += matrix[indices[i]][indices[i + 1]];
}
cost += matrix[indices[indices.size() - 1]][indices[0]];
return cost;
}
Oh, and PPL/TBB are fair game too.
Edit
As an afterthought and as commented to John, would there be a place to employ std::common_type in the calculation as the input and output types may differ? This is a bit of hand-waving and more like learning techniques and libraries. A form of code kata, if you will.
Edit 2
Now, there's one option to make the loops faster, explained in blog writing How to process a STL vector using SSE code by a blogger theowl84. The code uses __m128 directly, but I wonder if there's something in DirectXMath library too.
Edit 3
Now, after writing some concrete code, I found std::accumulate wouldn't get me far. Or at least I couldn't find a way to do the [indices[i + 1] part in matrix[indices[i]][indices[i + 1]]; in a neat way, as std::accumulate itself gives access to only the current value and the sum. In that light, it looks like novelocrat's approach would be the most fruitful one.
DeadMG proposed using parallel_reduce with associativity caveats, further commented by novelocrat. I didn't go about seeing if I could use parallel_reduce, as the interface looked somewhat cumbersome for quick trying. Other than that, even though my code executes serially, it would suffer from the same floating some issues as the parallel reduction version. Though the parallel version would/could be (much) more unpredictable with than serial version, I think.
This goes somewhat tangential, but it may be of interest to some stumbling here, and to those of whom have read this far, may be (very) interested on article Wandering Precision in The NAG blog, which details some intricanciens even introduced by hardware instruction re-ordering! Then there are some ruminations about this very issue in distributed setting in #AltDevBlogADay Synchronous RTS Engines and a Tale of Desyncs. Also, ACCU (the general mailing list is excellent, by the way, and it's free to join) features several articles (e.g. this) on floating point accuracy. A tangential to tangential, I found Fernando Cacciola's Robustness issues in geometric computing to be a good article to read, originally from ACCU mailing list.
And then then the std::common_type. I couldn't find usage for that. If I had two different types as parameters, then the return value could/should be decided by std::common_type. Perhaps more pertinent is std::is_convertible with static_assert to make sure the desired result type is convertible from the argument types (with a clean error message). Other than that, I can only make up a check that the return value/intermediate calculation value accurracy is sufficient to represent the result of summation without overflows and things like that, but I haven't come across a standard facility for that.
That about that, I think, ladies and gentlemen. I enjoyed myself, I hope those reading this got something out of this too.

You could produce an iterator that takes matrix and indices and yields the appropriate values.
class route_iterator
{
vector<vector<float>> const& matrix;
vector<int> const& indices;
int i;
public:
route_iterator(vector<vector<float>> const& matrix_, vector<int> const& indices_,
int begin = 0)
: matrix(matrix_), indices(indices_), i(begin)
{ }
float operator*() {
return matrix[indices[i]][indices[(i + 1) % indices.size()]];
}
route_iterator& operator++() {
++i;
return *this;
}
};
Then your accumulate runs from route_iterator(matrix, indices) to route_iterator(matrix, indices, indices.size()).
Admittedly, though, this sequentializes without a smart compiler turning it into something parallel. What you really want are parallel map and fold (accumulate) operations.

out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++)
{
cost += matrix[indices[i]][indices[i + 1]];
}
This is basically std::accumulate. PPL provides (and so does TBB, if I recall) parallel_reduce. This requires associativity but not commutivity, and + over the real/float/integer is associative.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

std::transform slower than for loops - c++

Related

Which c++ implementation is preferable, range based loop or count_if

How to use algorithms to fill vector of vectors

Swap columns in c++

Which one is more optimized for accessing array?

How to use std::accumulate to neatly sum values in a vector pointed by separately defined indices (replacing loops)

Categories

Resources