Safe and fast FFT - c++

Inspired by Herb Sutter's compelling lecture Not your father's C++, I decided to take another look at the latest version of C++ using Microsoft's Visual Studio 2010. I was particularly interested by Herb's assertion that C++ is "safe and fast" because I write a lot of performance-critical code.
As a benchmark, I decided to try to write the same simple FFT algorithm in a variety of languages.
I came up with the following C++11 code that uses the built-in complex type and vector collection:
#include <complex>
#include <vector>
using namespace std;
// Must provide type or MSVC++ barfs with "ambiguous call to overloaded function"
double pi = 4 * atan(1.0);
void fft(int sign, vector<complex<double>> &zs) {
unsigned int j=0;
// Warning about signed vs unsigned comparison
for(unsigned int i=0; i<zs.size()-1; ++i) {
if (i < j) {
auto t = zs.at(i);
zs.at(i) = zs.at(j);
zs.at(j) = t;
}
int m=zs.size()/2;
j^=m;
while ((j & m) == 0) { m/=2; j^=m; }
}
for(unsigned int j=1; j<zs.size(); j*=2)
for(unsigned int m=0; m<j; ++m) {
auto t = pi * sign * m / j;
auto w = complex<double>(cos(t), sin(t));
for(unsigned int i = m; i<zs.size(); i+=2*j) {
complex<double> zi = zs.at(i), t = w * zs.at(i + j);
zs.at(i) = zi + t;
zs.at(i + j) = zi - t;
}
}
}
Note that this function only works for n-element vectors where n is an integral power of two. Anyone looking for fast FFT code that works for any n should look at FFTW.
As I understand it, the traditional xs[i] syntax from C for indexing a vector does not do bounds checking and, consequently, is not memory safe and can be a source of memory errors such as non-deterministic corruption and memory access violations. So I used xs.at(i) instead.
Now, I want this code to be "safe and fast" but I am not a C++11 expert so I'd like to ask for improvements to this code that would make it more idiomatic or efficient?

I think you are being overly "safe" in your use of at(). In most of your cases the index used is trivially verifable as being constrained by the size of the container in the for loop.
e.g.
for(unsigned int i=0; i<zs.size()-1; ++i) {
...
auto t = zs.at(i);
The only ones I'd leave as at()s are the (i + j)s. It's not immediately obvious whether they would always be constrained (although if I was really unsure I'd probably manually check - but I'm not familiar with FFTs enough to have an opinion in this case).
There are also some fixed computations being repeated for each loop iteration:
int m=zs.size()/2;
pi * sign
2*j
And the zs.at(i + j) is computed twice.
It's possible that the optimiser may catch these - but if you are treating this as performance critical, and you have your timers testing it, I'd hoist them out of the loops (or, in the case of zs.at(i + j), just take a reference on first use) and see if that impacts the timer.
Talking of second-guessing the optimiser: I'm sure that the calls to .size() will be inlined as, at least, a direct call to an internal member variable - but given how many times you call it I'd also experiment with introducing local variables for zs.size() and zs.size()-1 upfront. They're more likely to be put into registers that way too.
I don't know how much of a difference (if any) all of this will have on your total runtime - some of it may already be caught by the optimiser, and the differences may be small compared to the computations involved - but worth a shot.
As for being idiomatic my only comment, really, is that size() returns a std::size_t (which is usually a typedef for an unsigned int - but it's more idiomatic to use that type instead). If you did want to use auto but avoid the warning you could try adding the ul suffix to the 0 - not sure I'd say that is idiomatic, though. I suppose you're already less than idiomatic in not using iterators here, but I can see why you can't do that (easily).
Update
I gave all my suggestions a try and they all had a measurable performance improvement - except the i+j and 2*j precalcs - they actually caused a slight slowdown! I presume they either prevented a compiler optimisation or prevented it from using registers for some things.
Overall I got a >10% perf. improvement with those suggestions.
I suspect more could be had if the second block of loops was refactored a little to avoid the jumps - and having done so enabling SSE2 instruction set may give a significant boost (I did try it as is and saw a slight slowdown).
I think that refactoring, along with using something like MKL for the cos and sin calls should give greater, and less brittle, improvements. And neither of those things would be language dependent (I know this was originally being compared to an F# implementation).
Update 2
I forgot to mention that pre-calculating zs.size() did make a difference.
Update 3
Also forgot to say (until reminded by #xeo in comment to OP) that the block following the i < j check can be boiled down to a std::swap. This is more idiomatic and at least as performant - in the worst case should inline to the same code as written. Indeed when I did it I saw no change in the performance. In other cases it can lead to a performance gain if move constructors are available.

Related

Speeding Up Arithmetic Operation in C++ [duplicate]

For example, I have three float arrays, a, b and c, and I want to add a and b element-wisely up to c. A naive way is like
for(int i = 0; i < n; i++){
c[i] = a[i] + b[i];
}
As far as I know, OpenMP can parallelize this piece of code. In OpenCV code, I see some flags like CV_SSE2 and CV_NEON which are related to optimization.
What's the common way to optimize these kinds of code, if I want my code highly efficient?
There is no common strategy. You should be sure that it is a bottleneck (which it might not be, if the size n of your arrays is small enough).
Some compilers are able to optimize that (at least in some simple cases) by using vector machine instructions. With GCC try to compile with gcc -O3 -mtune=native (or other -mtune=... or -mfpu=... arguments, in particular if you are cross-compiling) and possibly -ffast-math
You could consider OpenMP, OpenCL (with a GPGPU), OpenACC, MPI, explicit threading with e.g. pthreads or C++11 std::thread-s, etc... (and a clever mix of several approaches)
I would leave the optimization to the compiler, and only consider improving that if you measure that it is a bottleneck. You could spend months or years (or even specialize in that for your whole work life) of your developer time to improve it ....
You could also use some numerical computation library (e.g. LAPACK, GSL, etc...) or specialized software like Scilab, Octave, R, etc...
Read also http://floating-point-gui.de/
You should continue looking into parallel options. But for single-threaded, it's generally faster to do it like this:
int i = 0;
for (; i < n - 3; i += 4) {
c[i] = a[i] + b[i];
c[i + 1] = a[i + 1] + b[i + 1];
c[i + 2] = a[i + 2] + b[i + 2];
c[i + 3] = a[i + 3] + b[i + 3];
}
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
Sometimes unrolling can be done by the compiler, but at least in my experience (I use MSC), the compiler typically never tries to perform any partial unrolling like this, and sometimes it can help. This can be beneficial when each of the 4 things inside the loop can be pipelined and running in parallel and it saves comparisons/jumps.
So I would use this as a starting point, and measure it. Then, only apply the parallelization if you measure a gain over this. Or, if you make your threads by hand, each thread should probably do the unrolled variant.
Update: I'm not personally seeing any gain from this. I think it's because inside the unrolled loop, a full 12 floats are accessed. And the float operations are likely slow enough to negate any savings from the jge/cmp operations that are eliminated by unrolling it.
Still, whenever you have a similar problem, with lighter, independent operations, I still recommend at least trying this, because it generates clearly different assembly when you unroll it in the code and you'll get some different perf characteristics and reduce the number of cmp/jmp by a factor of 4, which can help but I think the floating point operations are just too significant for this to matter here.
As already mentioned by others, there is not the "common strategy" but it really depends on your particular use case: Are the arrays very large? Are they rather small but you have to call this function very frequently? Such question you will have to ask yourself. And before trying to optimize anything, you should always profile your code. In most applications more than 90% of the time is spend in only less than 10% of the code. Unless you know exactly where to find this 10% it can have little to no effect to optimize parts of the application.
However, when it is about arithmetic computations, I think it is always a good start to rely on the optimized standard algorithms. When concerned about efficiency, I would add two arrays (after putting a and b in a std::vector or std::array and preallocating c) via
std::transform(a.begin(), a.end(), b.begin(),c.begin(), std::plus<float>());
Depending on your compiler's optimization stage an array index a[i]may be slower than a pointer dereference *p (with p incremented in each iteration so p = a+i)
So without relying on the optimizer this may be faster with some compilers:
float* pa = a;
float* pb = b;
float* pc = c;
for(int i = 0; i < n; i++)
*pc++ = *pa++ + *pb++;
While it may seem trivial in this case, this basic technique can result in large gains in more complicated cases where things are too complicated for the optimizer to do the job.

What's the common strategy to optimize c++ arithmetic computation for arrays?

For example, I have three float arrays, a, b and c, and I want to add a and b element-wisely up to c. A naive way is like
for(int i = 0; i < n; i++){
c[i] = a[i] + b[i];
}
As far as I know, OpenMP can parallelize this piece of code. In OpenCV code, I see some flags like CV_SSE2 and CV_NEON which are related to optimization.
What's the common way to optimize these kinds of code, if I want my code highly efficient?
There is no common strategy. You should be sure that it is a bottleneck (which it might not be, if the size n of your arrays is small enough).
Some compilers are able to optimize that (at least in some simple cases) by using vector machine instructions. With GCC try to compile with gcc -O3 -mtune=native (or other -mtune=... or -mfpu=... arguments, in particular if you are cross-compiling) and possibly -ffast-math
You could consider OpenMP, OpenCL (with a GPGPU), OpenACC, MPI, explicit threading with e.g. pthreads or C++11 std::thread-s, etc... (and a clever mix of several approaches)
I would leave the optimization to the compiler, and only consider improving that if you measure that it is a bottleneck. You could spend months or years (or even specialize in that for your whole work life) of your developer time to improve it ....
You could also use some numerical computation library (e.g. LAPACK, GSL, etc...) or specialized software like Scilab, Octave, R, etc...
Read also http://floating-point-gui.de/
You should continue looking into parallel options. But for single-threaded, it's generally faster to do it like this:
int i = 0;
for (; i < n - 3; i += 4) {
c[i] = a[i] + b[i];
c[i + 1] = a[i + 1] + b[i + 1];
c[i + 2] = a[i + 2] + b[i + 2];
c[i + 3] = a[i + 3] + b[i + 3];
}
for (; i < n; i++) {
c[i] = a[i] + b[i];
}
Sometimes unrolling can be done by the compiler, but at least in my experience (I use MSC), the compiler typically never tries to perform any partial unrolling like this, and sometimes it can help. This can be beneficial when each of the 4 things inside the loop can be pipelined and running in parallel and it saves comparisons/jumps.
So I would use this as a starting point, and measure it. Then, only apply the parallelization if you measure a gain over this. Or, if you make your threads by hand, each thread should probably do the unrolled variant.
Update: I'm not personally seeing any gain from this. I think it's because inside the unrolled loop, a full 12 floats are accessed. And the float operations are likely slow enough to negate any savings from the jge/cmp operations that are eliminated by unrolling it.
Still, whenever you have a similar problem, with lighter, independent operations, I still recommend at least trying this, because it generates clearly different assembly when you unroll it in the code and you'll get some different perf characteristics and reduce the number of cmp/jmp by a factor of 4, which can help but I think the floating point operations are just too significant for this to matter here.
As already mentioned by others, there is not the "common strategy" but it really depends on your particular use case: Are the arrays very large? Are they rather small but you have to call this function very frequently? Such question you will have to ask yourself. And before trying to optimize anything, you should always profile your code. In most applications more than 90% of the time is spend in only less than 10% of the code. Unless you know exactly where to find this 10% it can have little to no effect to optimize parts of the application.
However, when it is about arithmetic computations, I think it is always a good start to rely on the optimized standard algorithms. When concerned about efficiency, I would add two arrays (after putting a and b in a std::vector or std::array and preallocating c) via
std::transform(a.begin(), a.end(), b.begin(),c.begin(), std::plus<float>());
Depending on your compiler's optimization stage an array index a[i]may be slower than a pointer dereference *p (with p incremented in each iteration so p = a+i)
So without relying on the optimizer this may be faster with some compilers:
float* pa = a;
float* pb = b;
float* pc = c;
for(int i = 0; i < n; i++)
*pc++ = *pa++ + *pb++;
While it may seem trivial in this case, this basic technique can result in large gains in more complicated cases where things are too complicated for the optimizer to do the job.

The simple task of iterating through an array. Which of these solutions is the most efficient?

Recently, I've been thinking about all the ways that one could iterate through an array and wondered which of these is the most (and least) efficient. I've written a hypothetical problem and five possible solutions.
Problem
Given an int array arr with len number of elements, what would be the most efficient way of assigning an arbitrary number 42 to every element?
Solution 0: The Obvious
for (unsigned i = 0; i < len; ++i)
arr[i] = 42;
Solution 1: The Obvious in Reverse
for (unsigned i = len - 1; i >= 0; --i)
arr[i] = 42;
Solution 2: Address and Iterator
for (unsigned i = 0; i < len; ++i)
{ *arr = 42;
++arr;
}
Solution 3: Address and Iterator in Reverse
for (unsigned i = len; i; --i)
{ *arr = 42;
++arr;
}
Solution 4: Address Madness
int* end = arr + len;
for (; arr < end; ++arr)
*arr = 42;
Conjecture
The obvious solutions are almost always used, but I wonder whether the subscript operator could result in a multiplication instruction, as if it had been written like *(arr + i * sizeof(int)) = 42.
The reverse solutions try to take advantage of how comparing i to 0 instead of len might mitigate a subtraction operation. Because of this, I prefer Solution 3 over Solution 2. Also, I've read that arrays are optimized to be accessed forwards because of how they're stored in the cache, which could present an issue with Solution 1.
I don't see why Solution 4 would be any less efficient than Solution 2. Solution 2 increments the address and the iterator, while Solution 4 only increments the address.
In the end, I'm not sure which of these solutions I prefer. I'm think the answer also varies with the target architecture and optimization settings of your compiler.
Which of these do you prefer, if any?
Just use std::fill.
std::fill(arr, arr + len, 42);
Out of your proposed solutions, on a good compiler, neither should be faster than the others.
The ISO standard doesn't mandate the efficiency of the different ways of doing things in code (other than certain big-O type stuff for some collection algorithms), it simply mandates how it functions.
Unless your arrays are billions of elements in size, or you're wanting to set them millions of times per minute, it generally won't make the slightest difference which method you use.
If you really want to know (and I still maintain it's almost certainly unnecessary), you should benchmark the various methods in the target environment. Measure, don't guess!
As to which I prefer, my first inclination is to optimise for readability. Only if there's a specific performance problem do I then consider other possibilities. That would be simply something like:
for (size_t idx = 0; idx < len; idx++)
arr[idx] = 42;
I don't think that performance is an issue here - those are, if at all (I could imagine the compiler producing the identical assembly for most of them), micro optimizations hardly ever necessary.
Go with the solution that is most readable; the standard library provides you with std::fill, or for more complex assignments
for(unsigned k = 0; k < len; ++k)
{
// whatever
}
so it is obvious to other people looking at your code what you are doing. With C++11 you could also
for(auto & elem : arr)
{
// whatever
}
just don't try to obfuscate your code without any necessity.
For nearly all meaningful cases, the compiler will optimize all of the suggested ones to the same thing, and it's very unlikely to make any difference.
There used to be a trick where you could avoid the automatic prefetching of data if you ran the loop backwards, which under some bizarre set of circumstances actually made it more efficient. I can't recall the exact circumstances, but I expect modern processors will identify backwards loops as well as forwards loops for automatic prefetching anyway.
If it's REALLY important for your application to do this over a large number of elements, then looking at blocked access and using non-temporal storage will be the most efficient. But before you do that, make sure you have identified the filling of the array as an important performance point, and then make measurements for the current code and the improved code.
I may come back with some actual benchmarks to prove that "it makes little difference" in a bit, but I've got an errand to run before it gets too late in the day...

Which one is more optimized for accessing array?

Solving the following exercise:
Write three different versions of a program to print the elements of
ia. One version should use a range for to manage the iteration, the
other two should use an ordinary for loop in one case using subscripts
and in the other using pointers. In all three programs write all the
types directly. That is, do not use a type alias, auto, or decltype to
simplify the code.[C++ Primer]
a question came up: Which of these methods for accessing array is optimized in terms of speed and why?
My Solutions:
Foreach Loop:
int ia[3][4]={{1,2,3,4},{5,6,7,8},{9,10,11,12}};
for (int (&i)[4]:ia) //1st method using for each loop
for(int j:i)
cout<<j<<" ";
Nested for loops:
for (int i=0;i<3;i++) //2nd method normal for loop
for(int j=0;j<4;j++)
cout<<ia[i][j]<<" ";
Using pointers:
int (*i)[4]=ia;
for(int t=0;t<3;i++,t++){ //3rd method. using pointers.
for(int x=0;x<4;x++)
cout<<(*i)[x]<<" ";
Using auto:
for(auto &i:ia) //4th one using auto but I think it is similar to 1st.
for(auto j:i)
cout<<j<<" ";
Benchmark result using clock()
1st: 3.6 (6,4,4,3,2,3)
2nd: 3.3 (6,3,4,2,3,2)
3rd: 3.1 (4,2,4,2,3,4)
4th: 3.6 (4,2,4,5,3,4)
Simulating each method 1000 times:
1st: 2.29375 2nd: 2.17592 3rd: 2.14383 4th: 2.33333
Process returned 0 (0x0) execution time : 13.568 s
Compiler used:MingW 3.2 c++11 flag enabled. IDE:CodeBlocks
I have some observations and points to make and I hope you get your answer from this.
The fourth version, as you mention yourself, is basically the same as the first version. auto can be thought of as only a coding shortcut (this is of course not strictly true, as using auto can result in getting different types than you'd expected and therefore result in different runtime behavior. But most of the time this is true.)
Your solution using pointers is probably not what people mean when they say that they are using pointers! One solution might be something like this:
for (int i = 0, *p = &(ia[0][0]); i < 3 * 4; ++i, ++p)
cout << *p << " ";
or to use two nested loops (which is probably pointless):
for (int i = 0, *p = &(ia[0][0]); i < 3; ++i)
for (int j = 0; j < 4; ++j, ++p)
cout << *p << " ";
from now on, I'm assuming this is the pointer solution you've written.
In such a trivial case as this, the part that will absolutely dominate your running time is the cout. The time spent in bookkeeping and checks for the loop(s) will be completely negligible comparing to doing I/O. Therefore, it won't matter which loop technique you use.
Modern compilers are great at optimizing such ubiquitous tasks and access patterns (iterating over an array.) Therefore, chances are that all these methods will generate exactly the same code (with the possible exception of the pointer version, which I will talk about later.)
The performance of most codes like this will depend more on the memory access pattern rather than how exactly the compiler generates the assembly branch instructions (and the rest of the operations.) This is because if a required memory block is not in the CPU cache, it's going to take a time roughly equivalent of several hundred CPU cycles (this is just a ballpark number) to fetch those bytes from RAM. Since all the examples access memory in exactly the same order, their behavior in respect to memory and cache will be the same and will have roughly the same running time.
As a side note, the way these examples access memory is the best way for it to be accessed! Linear, consecutive and from start to finish. Again, there are problems with the cout in there, which can be a very complicated operation and even call into the OS on every invocation, which might result, among other things, an almost complete deletion (eviction) of everything useful from the CPU cache.
On 32-bit systems and programs, the size of an int and a pointer are usually equal (both are 32 bits!) Which means that it doesn't matter much whether you pass around and use index values or pointers into arrays. On 64-bit systems however, a pointer is 64 bits but an int will still usually be 32 bits. This suggests that it is usually better to use indexes into arrays instead of pointers (or even iterators) on 64-bit systems and programs.
In this particular example, this is not significant at all though.
Your code is very specific and simple, but the general case, it is almost always better to give as much information to the compiler about your code as possible. This means that you must use the narrowest, most specific device available to you to do a job. This in turn means that a generic for loop (i.e. for (int i = 0; i < n; ++i)) is worse than a range-based for loop (i.e. for (auto i : v)) for the compiler, because in the latter case the compiler simply knows that you are going to iterate over the whole range and not go outside of it or break out of the loop or something, while in the generic for loop case, specially if your code is more complex, the compiler cannot be sure of this and has to insert extra checks and tests to make sure the code executes as the C++ standard says it should.
In many (most?) cases, although you might think performance matters, it does not. And most of the time you rewrite something to gain performance, you don't gain much. And most of the time the performance gain you get is not worth the loss in readability and maintainability that you sustain. So, design your code and data structures right (and keep performance in mind) but avoid this kind of "micro-optimization" because it's almost always not worth it and even harms the quality of the code too.
Generally, performance in terms of speed is very hard to reason about. Ideally you have to measure the time with real data on real hardware in real working conditions using sound scientific measuring and statistical methods. Even measuring the time it takes a piece of code to run is not at all trivial. Measuring performance is hard, and reasoning about it is harder, but these days it is the only way of recognizing bottlenecks and optimizing the code.
I hope I have answered your question.
EDIT: I wrote a very simple benchmark for what you are trying to do. The code is here. It's written for Windows and should be compilable on Visual Studio 2012 (because of the range-based for loops.) And here are the timing results:
Simple iteration (nested loops): min:0.002140, avg:0.002160, max:0.002739
Simple iteration (one loop): min:0.002140, avg:0.002160, max:0.002625
Pointer iteration (one loop): min:0.002140, avg:0.002160, max:0.003149
Range-based for (nested loops): min:0.002140, avg:0.002159, max:0.002862
Range(const ref)(nested loops): min:0.002140, avg:0.002155, max:0.002906
The relevant numbers are the "min" times (over 2000 runs of each test, for 1000x1000 arrays.) As you see, there is absolutely no difference between the tests. Note that you should turn on compiler optimizations or test 2 will be a disaster and cases 4 and 5 will be a little worse than 1 and 3.
And here are the code for the tests:
// 1. Simple iteration (nested loops)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows; ++i)
for (unsigned j = 0; j < gc_Cols; ++j)
sum += g_Data[i][j];
// 2. Simple iteration (one loop)
unsigned sum = 0;
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += g_Data[i / gc_Cols][i % gc_Cols];
// 3. Pointer iteration (one loop)
unsigned sum = 0;
unsigned * p = &(g_Data[0][0]);
for (unsigned i = 0; i < gc_Rows * gc_Cols; ++i)
sum += *p++;
// 4. Range-based for (nested loops)
unsigned sum = 0;
for (auto & i : g_Data)
for (auto j : i)
sum += j;
// 5. Range(const ref)(nested loops)
unsigned sum = 0;
for (auto const & i : g_Data)
for (auto const & j : i)
sum += j;
It has many factors affecting it:
It depends on the compiler
It depends on the compiler flags used
It depends on the computer used
There is only one way to know the exact answer: measuring the time used when dealing with huge arrays (maybe from a random number generator) which is the same method you have already done except that the array size should be at least 1000x1000.

How to use std::accumulate to neatly sum values in a vector pointed by separately defined indices (replacing loops)

I was wondering if there's a neater (or better yet, more efficient), method of summing values of a vector/(asymmetric) matrix (a matrix having structure like symmetry, could of course be exploited in looping, but not that pertinent to my question) pointed by a collection of indices. Basically this code could be used to calculate, say, a cost of a route through a 2D matrix. I'm looking for a way to utilize CPU, not GPU.
Here's some relevant code, the one I'm more interested is the first case. I was thinking it's possible to use std::accumulate with a lambda to capture the indices vector, but then I got wondering, if there's already a neater way, perhaps with some other operator. Not a "real problem" as looping is quite clear for my tastes too, but in hunt for the super-neat or more efficient on-liner...
template<typename out_type>
out_type sum(std::vector<float> const& matrix, std::vector<int> const& indices)
{
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; ++i)
{
const int index = indices.size() * indices[i] + indices[i + 1];
cost += matrix[index];
}
const int index = indices.size() * indices[indices.size() - 1] + indices[0];
cost += matrix[index];
return cost;
}
template<typename out_type>
out_type sum(std::vector<std::vector<float>> const& matrix, std::vector<int> const& indices)
{
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++)
{
cost += matrix[indices[i]][indices[i + 1]];
}
cost += matrix[indices[indices.size() - 1]][indices[0]];
return cost;
}
Oh, and PPL/TBB are fair game too.
Edit
As an afterthought and as commented to John, would there be a place to employ std::common_type in the calculation as the input and output types may differ? This is a bit of hand-waving and more like learning techniques and libraries. A form of code kata, if you will.
Edit 2
Now, there's one option to make the loops faster, explained in blog writing How to process a STL vector using SSE code by a blogger theowl84. The code uses __m128 directly, but I wonder if there's something in DirectXMath library too.
Edit 3
Now, after writing some concrete code, I found std::accumulate wouldn't get me far. Or at least I couldn't find a way to do the [indices[i + 1] part in matrix[indices[i]][indices[i + 1]]; in a neat way, as std::accumulate itself gives access to only the current value and the sum. In that light, it looks like novelocrat's approach would be the most fruitful one.
DeadMG proposed using parallel_reduce with associativity caveats, further commented by novelocrat. I didn't go about seeing if I could use parallel_reduce, as the interface looked somewhat cumbersome for quick trying. Other than that, even though my code executes serially, it would suffer from the same floating some issues as the parallel reduction version. Though the parallel version would/could be (much) more unpredictable with than serial version, I think.
This goes somewhat tangential, but it may be of interest to some stumbling here, and to those of whom have read this far, may be (very) interested on article Wandering Precision in The NAG blog, which details some intricanciens even introduced by hardware instruction re-ordering! Then there are some ruminations about this very issue in distributed setting in #AltDevBlogADay Synchronous RTS Engines and a Tale of Desyncs. Also, ACCU (the general mailing list is excellent, by the way, and it's free to join) features several articles (e.g. this) on floating point accuracy. A tangential to tangential, I found Fernando Cacciola's Robustness issues in geometric computing to be a good article to read, originally from ACCU mailing list.
And then then the std::common_type. I couldn't find usage for that. If I had two different types as parameters, then the return value could/should be decided by std::common_type. Perhaps more pertinent is std::is_convertible with static_assert to make sure the desired result type is convertible from the argument types (with a clean error message). Other than that, I can only make up a check that the return value/intermediate calculation value accurracy is sufficient to represent the result of summation without overflows and things like that, but I haven't come across a standard facility for that.
That about that, I think, ladies and gentlemen. I enjoyed myself, I hope those reading this got something out of this too.
You could produce an iterator that takes matrix and indices and yields the appropriate values.
class route_iterator
{
vector<vector<float>> const& matrix;
vector<int> const& indices;
int i;
public:
route_iterator(vector<vector<float>> const& matrix_, vector<int> const& indices_,
int begin = 0)
: matrix(matrix_), indices(indices_), i(begin)
{ }
float operator*() {
return matrix[indices[i]][indices[(i + 1) % indices.size()]];
}
route_iterator& operator++() {
++i;
return *this;
}
};
Then your accumulate runs from route_iterator(matrix, indices) to route_iterator(matrix, indices, indices.size()).
Admittedly, though, this sequentializes without a smart compiler turning it into something parallel. What you really want are parallel map and fold (accumulate) operations.
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++)
{
cost += matrix[indices[i]][indices[i + 1]];
}
This is basically std::accumulate. PPL provides (and so does TBB, if I recall) parallel_reduce. This requires associativity but not commutivity, and + over the real/float/integer is associative.