Passing address of iterator to function in STL::for_each

Passing address of iterator to function in STL::for_each - c++

I have a function that I eventually want to parallelize.
Currently, I call things in a for loop.
double temp = 0;
int y = 123; // is a value set by other code
for(vector<double>::iterator i=data.begin(); i != data.end(); i++){
temp += doStuff(i, y);
}
doStuff needs to know how far down the list it is. So I use i - data.begin() to calculate.
Next, I'd like to use the stl::for_each function instead. My challenge is that I need to pass the address of my iterator and the value of y. I've seen examples of using bind2nd to pass a parameter to the function, but how can I pass the address of the iterator as the first parameter?
The boost FOREACH functions also looks like a possibility, however I do not know if it will parallelize auto-magically like the STL version does.
Thoughts, ideas, suggestions?

If you want real parallelization here, use
GCC with tree vectorization optimization on (-O3) and SIMD (e.g. -march=native to get SSE support). If the operation (dostuff) is non-trivial, you could opt to do it ahead of time (std::transform or std::for_each) and accumulate next (std::accumulate) since the accumulation will be optimized like nothing else on SSE instructions!
void apply_function(double& value)
{
value *= 3; // just a sample...
}
// ...
std::vector<double> data(1000);
std::for_each(data.begin(), data.end(), &apply_function);
double sum = std::accumulate(data.begin(), data.end(), 0);
Note that though this will not actually run on multiple threads, the performance increase will be massive since SSE4 instructions can handle many floating operations *in parallell _on a single core_ .
If you wanted true parallelism, use one of the following
GNU Parallel Mode
Compile with g++ -fopenmp -D_GLIBCXX_PARALLEL:
__gnu_parallel::accumulate(data.begin(), data.end(), 0.0);
OpenMP directly
Compile with g++ -fopenmp
double sum = 0.0;
#pragma omp parallel for reduction (+:sum)
for (size_t i=0; i<data.end(); i++)
{
sum += do_stuff(i, data[i]);
}
This will result in the loop being parallelized into as many threads (OMP team) as there are (logical) CPU cores on the actual machine, and the result 'magically' combined and synchronized.
Final remarks:
You can simulate the binary function for for_each by using a stateful function object. This is not exactly recommended practice. It will also appear to be very inefficient (when compiling without optimization, it is). This is due to the fact that function objects are passed by value thoughout the STL. However, it is reasonable to expect a compiler to completely optimize the potential overhead of that away, especially for simple cases like the following:
struct myfunctor
{
size_t index;
myfunctor() : index(0) {}
double operator()(const double& v) const
{
return v * 3; // again, just a sample
}
};
// ...
std::for_each(data.begin(), data.end(), myfunctor());

temp += doStuff( i, y ); cannot be auto parallelized. The operator += doesn't play well with concurrency.
Further the stl algorithms don't parallelize anything. Both Visual Studio and GCC have parallel algorithms similar to std::for_each. If that is what you're after you'll have to use those.
OpenMP can auto parallelize for loops, but you have to use pragmas to tell the compiler when and how (it can't figure it out for you).
You may have confused parallelization with loop unrolling, which is a common optimization in std::for_each implementations.

This is fairly straightforward if you can change doStuff so that it takes the value of the current element separately from the index at which the current element is located. Consider:
struct context {
std::size_type _index;
int _y;
double _result;
};
context do_stuff_wrapper(context current, double value)
{
current._result += doStuff(current._index, value, current._y);
current._index++;
}
context c = { 0, 123, 0.0 };
context result = std::accumulate(data.begin(), data.end(), c, do_stuff_wrapper);
Note, however, that the Standard Library algorithms cannot "auto-parallelize" because the functions they call may have side effects (the compiler knows whether side effects are produced, but the library functions don't). If you want a parallelized loop, you'll have to go with a special-purpose parallelizing algorithms library, like PPL or TBB.

Related

Why is my parallel foreach loop implementation slower than the single-threaded one?

I am trying to implement the parallel foreach loop for std::vector which runs the computations in optimal number of threads (number of cores minus 1 for main thread), however, my implementation seems to be not fast enough – it actually runs 6 times slower than the single-threaded one!
The thread instantiation is often blamed for being a bottleneck so I tried a larger vector, however, that did not seem to help.
I am currently stuck watching the parallel algorithm executed in 13000-20000 microseconds in a separate thread while single-threaded one is executed in 120-200 microseconds in the main thread and cannot figure out what I am doing wrong. Out of those 13-20 ms parallel algorithm runs for 8 or 9 are usually utilized to create thread, however, I can still see no reason for std::for_each running through 1/3 of the vector in a separate thread for several times longer than another std::for_each need to iterate through the whole vector.
#include <iostream>
#include <vector>
#include <thread>
#include <algorithm>
#include <chrono>
const unsigned int numCores = std::thread::hardware_concurrency();
const size_t numUse = numCores - 1;
struct foreach
{
inline static void go(std::function<void(uint32_t&)>&& func, std::vector<uint32_t>& cont)
{
std::vector<std::thread> vec;
vec.reserve(numUse);
std::vector<std::vector<uint32_t>::iterator> arr(numUse + 1);
size_t distance = cont.size() / numUse;
for (size_t i = 0; i < numUse; i++)
arr[i] = cont.begin() + i * distance;
arr[numUse] = cont.end();
for (size_t i = 0; i < numUse - 1; i++)
{
vec.emplace_back([&] { std::for_each(cont.begin() + i * distance, cont.begin() + (i + 1) * distance, func); });
}
vec.emplace_back([&] { std::for_each(cont.begin() + (numUse - 1) * distance, cont.end(), func); });
for (auto &d : vec)
{
d.join();
}
}
};
int main()
{
std::chrono::steady_clock clock;
std::vector<uint32_t> numbers;
for (size_t i = 0; i < 50000000; i++)
numbers.push_back(i);
std::chrono::steady_clock::time_point t0m = clock.now();
std::for_each(numbers.begin(), numbers.end(), [](uint32_t& value) { ++value; });
std::chrono::steady_clock::time_point t1m = clock.now();
std::cout << "Single-threaded run executes in " << std::chrono::duration_cast<std::chrono::microseconds>(t1m - t0m).count() << "mcs\n";
std::chrono::steady_clock::time_point t0s = clock.now();
foreach::go([](uint32_t& i) { ++i; }, numbers);
std::chrono::steady_clock::time_point t1s = clock.now();
std::cout << "Multi-threaded run executes in " << std::chrono::duration_cast<std::chrono::microseconds>(t1s - t0s).count() << "mcs\n";
getchar();
}
Is there a way I can optimize this and increase the performance?
The compiler I am using is Visual Studio 2017's one. Config is Release x86. I have also been advised to use a profiler and am currently figuring out how to use one.
I actually managed to get parallel code run faster than the regular one, however, this required vector of dozens of thousands of vectors of five elements. If anyone has advices on how to improve performance or where can I find better implementation to check its structure, that would be appreciated.

Thank you for providing some example code.
Getting good metrics (especially on parallel code) can be pretty tricky. Your metrics are tainted.
Use high_resolution_clock instead of steady_clock for profiling.
Don't include the thread startup time in your timing measurement. Thread launch/join is orders of magnitude longer than your actual work here. You should create the threads once and use condition variables to make them sleep until you signal them to work. This is not trivial, but it is essential that you don't measure the thread startup time.
Visual Studio has a profiler. You need to compile your code with release optimizations but also include the debug symbols (those are excluded in the default release configuration). I haven't looked into how to set this up manually because I usually use CMake and it sets up a RelWithDebInfo configuration automatically.
Another issue kind of related to having good metrics is that your "work" is just incrementing an integer. Is that really representative of the work your program is going to be doing? Increment is really fast. If you look at the assembly generated by your sequential version, everything gets inlined into a really short loop.
Lambdas have a very good chance of being inlined. But in your go function, you're casting the lambda to std::function. std::function has a very poor chance of being inlined.
So if you want to keep the chance of getting the lambda inlined, you have to do some template tricks:
template <typename FUNC>
inline static void go(FUNC&& func, std::vector<uint32_t>& cont)
By manually inlining your code (I moved the contents of the go function to main) and doing step 2 above, I was able to get the parallel version (4 threads on a hyperthreaded dual-core) to run in about 75% of the time. That's not particularly good scaling, but it's not bad considering that the original was already pretty fast. For a further optimization, I would use SIMD aka "vector" (different from std::vector except in the sense that they both relate to arrays) operations which will apply the increment to multiple array elements in one iteration.
You have a race condition here:
for (size_t i = 0; i < numUse - 1; i++)
{
vec.emplace_back([&] { std::for_each(cont.begin() + i * distance, cont.begin() + (i + 1) * distance, func); });
}
because you set the default lambda capture to capture-by-reference, the i variable is a reference and that could cause some threads to check the wrong range or too long of a range. You could do this: [&, i], but why risk shooting yourself in the foot again? Scott Meyers recommends against using default capture modes. Just do [&cont, &distance, &func, i]
UPDATE:
I think it's a fine idea to move your foreach to its own space. I think what you should do is separate the thread creation from task dispatch. That means you need some kind of signaling system (generally condition variables). You could look into thread pools.
An easy way to add threadpools is to use OpenMP, which Visual Studio 2017 has support for (OpenMP 2.0). A caveat is that there's no guarantee that the threads won't be created/destroyed during entry/exit of the parallel section (it's implementation dependent). So it trades off performance with ease of use.
If you can use C++17, it has a standard parallel for_each (the ExecutionPolicy overload). Most of the algorithmy standards functions do. https://en.cppreference.com/w/cpp/algorithm/for_each
As for using std::function you can use it, you just don't want your basic operation (the one that will be called 50,000,000 times) to be a std::function.
Bad:
void go(std::function<...>& func)
{
std::thread t(std::for_each(v.begin(), v.end(), func));
...
}
...
go([](int& i) { ++i; });
Good:
void go(std::function<...>& func)
{
std::thread t(func);
...
}
...
go([&v](){ std::for_each(v.begin(), v.end(), [](int& i) { ++i; })});
In the good version, the short inner lambda (i.e. ++i) gets inlined in the call to for_each. That's important because it gets called 50 million times. The call to the bigger lambda is not inlined (because it's converted to std::function) but that's ok because it only gets called once per thread.

Compiler optimization eliminates effects of false sharing. How?

I'm trying to replicate the effects of false sharing using OpenMP as explained in the OpenMP introduction by Tim Mattson.
My program performs a straightforward numerical integration (see the link for the mathematical details) and I've implemented two versions, the first of which is supposed to be cache-friendly having each thread keep a local variable to accumulate its portion of the index space,
const auto num_slices = 100000000;
const auto num_threads = 4; // Swept from 1 to 9 threads
const auto slice_thickness = 1.0 / num_slices;
const auto slices_per_thread = num_slices / num_threads;
std::vector<double> partial_sums(num_threads);
#pragma omp parallel num_threads(num_threads)
{
double local_buffer = 0;
const auto thread_num = omp_get_thread_num();
for(auto slice = slices_per_thread * thread_num; slice < slices_per_thread * (thread_num + 1); ++slice)
local_buffer += func(slice * slice_thickness); // <-- Updates thread-exclusive buffer
partial_sums[thread_num] = local_buffer;
}
// Sum up partial_sums to receive final result
// ...
while the second version has each thread update an element in a shared std::vector<double>, causing each write to invalidate the cache lines on all other threads
// ... as above
#pragma omp parallel num_threads(num_threads)
{
const auto thread_num = omp_get_thread_num();
for(auto slice = slices_per_thread * thread_num; slice < slices_per_thread * (thread_num + 1); ++slice)
partial_sums[thread_num] += func(slice * slice_thickness); // <-- Invalidates caches
}
// Sum up partial_sums to receive final result
// ...
The problem is that I am unable to see any effects of false sharing whatsoever unless I turn off optimization.
Compiling my code (which has to account for a few more details than the snippets above) using GCC 8.1 without optimization (-O0) yields the results I naively expected while using full optimization (-O3) eliminates any difference in terms of performance between the two versions, as shown in the plot.
What's the explanation for this? Does the compiler actually eliminate false sharing? If not, how come the effect is so small when running the optimized code?
I'm on a Core-i7 machine using Fedora. The plot displays mean values whose sample standard deviations don't add any information to this question.

tl;dr: The compiler optimizes your second version into the first.
Consider the code within the loop of your second implementation - ignoring the OMP/multithreaded aspect of it for a moment.
You have increments of a value within an std::vector - which is necessarily located on the heap (well, up until and including in C++17 anyway). The compiler sees you're adding to a value on the heap within a loop; this is a typical candidate for optimization: It takes the heap access out of the loop, and uses a register as a buffer. It doesn't even need to read from the heap, since they're just additions - so it essentially arrives at your first solution.
See this happening on GodBolt (with a simplified example) - notice how the code for bar1() and bar2() is almost the same, with accumulation happening in registers.
Now, the fact that there's multi-threading and OMP involved doesn't change the above. If you were to use, say, std::atomic<double> instead of double, then it might have changed (and maybe not even then, if the compiler is smart enough).
Notes:
Thanks goes to #Evg for noticing a glaring mistake in the code of a previous version of this answer.
The compiler must be able to know that func() won't also change the value of your vector - or to decide that, for the purposes of addition, it shouldn't really matter.
This optimization could be seen as a Strength Reduction - from an operation on the heap to that on a register - but I'm not sure that term is in use for this case.

Which c++ implementation is preferable, range based loop or count_if

I read the question which implementation is preferable to perform a count of some vector items.
Is this better than
auto countif = [] (T t) { return t.countable(); };
const int count = std::count_if(v.begin(), v.end(), countif);
return count ;
this
int count = 0;
for ( auto& t : v )
if (t.countable()) count++;
The question has been voted down and thus been deleted.

You should almost always use an algorithm like std::count_if if one is available.
The reason is that the compiler vendor can put optimizations in that are not portable if you were to put them in manually in your own loop. For example there are intrinsic functions that could be CPU specific that speed up even basic tasks like counting values in an array.
Unless you have a specific need to use non-portable optimizations then algorithms provided by the compiler in the standard library are likely to be faster in a portable way than something you are likely to write.

Loop optimisation techniques in C++

In order to increase the performance of our applications, we have to consider loop optimisation techniques during the development phase.
I'd like to show you some different ways to iterate over a simple std::vector<uint32_t> v:
Unoptimized loop with index:
uint64_t sum = 0;
for (unsigned int i = 0; i < v.size(); i++)
sum += v[i];
Unoptimized loop with iterator:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it;
for (it = v.begin(); it != v.end(); it++)
sum += *it;
Cached std::vector::end iterator:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it, end(v.end());
for (it = v.begin(); it != end; it++)
sum += *it;
Pre-increment iterators:
uint64_t sum = 0;
std::vector<uint32_t>::const_iterator it, end(v.end());
for (it = v.begin(); it != end; ++it)
sum += *it;
Range-based loop:
uint64_t sum = 0;
for (auto const &x : v)
sum += x;
There are also other means to build a loop in C++; for instance by using std::for_each, BOOST_FOREACH, etc...
In your opinion, which is the best approach to increase the performance and why?
Furthermore, in performance-critical applications it could be useful to unroll the loops: again, which approach would you suggest?

There's no hard and fast rule, since it depends on the
implementation. If the measures I did some years back are
typical, however: about the only thing which makes a difference
is caching the end iterator. Pre- or post-fix makes no
difference, regardless of the container and iterator type.
At the time, I didn't measure indexing (because I was comparing
iterators of different types of container as well, and not all
support indexing). But I would guess that if you use indexes,
you should cache the results of v.size() as well.
Of course, these measures were for one compiler (g++) on one
system, with a specific hardware. The only way you can know for
your environment is to measure yourself.
RE your note: are you sure you have full optimization turned on.
My measures showed no difference between 3 and 4, and I doubt
that commpilers optimize less today.
It's very important for the optimizations here that the
functions are actually inlined. If they're not,
post-incrementation does require some extra copying, and
typically will require an extra function call (to the copy
constructor of the iterator) as well. Once the functions are
inlined, however, the compiler can easily see that all this is
a unessential, and (at least when I tried it) generate exactly
the same code in both cases. (I'd use pre-incrementation
anyway. Not because it makes a difference, but because if you
don't, some idiots will come along claiming it will, despite
your measures. Or maybe they're not idiots, but are just using
a particularly stupid compiler.)
To tell the truth, when I did the measurements, I was surprised
that caching the end iterator made a difference, even for
vector, where as there was no difference between pre- and
post-incrementation, even for a reverse iterator into a map.
After all, end() was inlined as well; in fact, every single
function used in my tests was inlined.
As to unrolling the loops: I'd probably do something like this:
std::vector<uint32_t>::const_iterator current = v.begin();
std::vector<uint32_t>::const_iterator end = v.end();
switch ( (end - current) % 4 ) {
case 3:
sum += *current ++;
case 2:
sum += *current ++;
case 1:
sum += *current ++;
case 0:
}
while ( current != end ) {
sum += current[0] + current[1] + current[2] + current[3];
current += 4;
}
(This is a factor of 4. You can easily increase it if
necessary.)

I'm going on the assumption that you are well aware of the evils of premature micro-optimization, and that you have identified hotspots in your code by profiling and all the rest. I'm also going on the assumption that you're only concerned about performance with respect to speed. That is, you don't care deeply about the size of the resulting code or memory use.
The code snippets you have provided will yield largely the same results, with the exception of the cached end() iterator. Aside from caching and inlining as much as you can, there is not much you can do to tweak structure of the loops above to realize significant gains in performance.
Writing performant code in critical paths relies first and foremost on selecting the best algorithm for the job. If you have a performance problem, look first and hard at the algorithm. The compiler will generally do a much better job at micro-optimizing the code you wrote than you could ever hope to.
All that being said, there are a few things you can do to give your compiler a little help.
Cache everything you can
Keep small allocations to a minimum, especially within a loop
Make as many things const as you can. This gives the compiler additional opportunities to micro-optimize.
Learn your toolchain well and leverage that knowledge
Learn your architecture well and leverage that knowledge
Learn to read assembly code and examine the assembly output from your compiler
Learning your toolchain and architecture are going to yield the most benefits. For example, GCC has many options you can enable to increase performance, including loop unrolling. See here. When iterating datasets, it is often beneficial to keep each item aligned to the size of a cache line. In modern architecture this often means 64 bytes, but learn your architecture.
Here is an excellent guide to writing performant C++ in an Intel environment.
Once you have learned your architecture and toolchain, you might find that the algorithm you originally selected is not optimal in your real world. Be open to change in the face of new data.

It's very likely that modern compilers will produce the same assembly for the approaches you give above. You should look at the actual assembly (after enabling optimizations) to see.
When you're down to worrying about the speed of your loops, you should really think about whether your algorithm is truly optimal. If you're convinced it is, then you need to think about (and make use of) the underlying implementation of the data structures. std::vector uses an array underneath, and, depending on the compiler and the other code in the function, pointer aliasing may prevent the compiler from fully optimizing your code.
There's a fair amount of information out there on pointer aliasing (including What is the strict aliasing rule?), but Mike Acton has some wonderful information about pointer aliasing.
The restrict keyword (see What does the restrict keyword mean in C++? or, again, Mike Acton), available through compiler extensions for many years and codified in C99 (currently only available as a compiler extension in C++), is meant to deal with this. The way to use this in your code is far more C-like, but may allow the compiler to better optimize your loop, at least for the examples you've given:
uint64_t sum = 0;
uint32_t *restrict velt = &v[0];
uint32_t *restrict vend = velt + v.size();
while(velt < vend) {
sum += *velt;
velt++;
}
However, to see whether this provides a difference, you really need to profile different approaches for your actual, real-life problem, and possibly look at the underlying assembly produced. If you're summing simple data types, this may help you. If you're doing anything more complicated, including calling a function that cannot be inlined in the loop, it's unlikely to make any different at all.

If you're using clang, then pass it these flags:
-Rpass-missed=loop-vectorize
-Rpass-analysis=loop-vectorize
In Visual C++ add this to the build:
/Qvec-report:2
These flags will tell you if a loop fails to vectorise (and give you an often cryptic message explaining why).
In general though, prefer options 4 and 5 (or std::for_each). Whilst clang and gcc will typically do a decent job in most cases, Visual C++ tends to err on the side of caution sadly. If the scope of the variable is unknown (e.g. a reference or pointer passed into a function, or this pointer), then vectorisation often fails (containers in the local scope will almost always vectorise).
#include <vector>
#include <cmath>
// fails to vectorise in Visual C++ and /O2
void func1(std::vector<float>& v)
{
for(size_t i = 0; i < v.size(); ++i)
{
v[i] = std::sqrt(v[i]);
}
}
// this will vectorise with sqrtps
void func2(std::vector<float>& v)
{
for(std::vector<float>::iterator it = v.begin(), e = v.end(); it != e; ++it)
{
*it = std::sqrt(*it);
}
}
Clang and gcc aren't immune to these issues either. If you always take a copy of begin/end, then it cannot be a problem.
Here's another classic that affects many compilers sadly (clang 3.5.0 fails this trivial test, but it's fixed in clang 4.0). It crops up a LOT!
struct Foo
{
void func3();
void func4();
std::vector<float> v;
float f;
};
// will not vectorise
void Foo::func3()
{
// this->v.end() !!
for(std::vector<float>::iterator it = v.begin(); it != v.end(); ++it)
{
*it *= f; // this->f !!
}
}
void Foo::func4()
{
// you need to take a local copy of v.end(), and 'f'.
const float temp = f;
for(std::vector<float>::iterator it = v.begin(), e = v.end(); it != e; ++it)
{
*it *= temp;
}
}
In the end, if it's something you care about, use the vectorisation reports from the compiler to fix up your code. As mentioned above, this is basically an issue of pointer aliasing. You can use the restrict keyword to help fix some of these issues (but I've found that applying restrict to 'this' is often not that useful).

Use range based for by default as it will give the compiler the most direct information to optimize (compiler knows it can cache the end iterator for example). Then profile and only optimize further if you identify a significant bottleneck. There will be very few real world situations where these different loop variants make a meaningful performance difference. Compilers are pretty good at loop optimization and it is far more likely that you should focus your optimization effort elsewhere (like choosing a better algorithm or focusing on optimizing the loop body).

How to use std::accumulate to neatly sum values in a vector pointed by separately defined indices (replacing loops)

I was wondering if there's a neater (or better yet, more efficient), method of summing values of a vector/(asymmetric) matrix (a matrix having structure like symmetry, could of course be exploited in looping, but not that pertinent to my question) pointed by a collection of indices. Basically this code could be used to calculate, say, a cost of a route through a 2D matrix. I'm looking for a way to utilize CPU, not GPU.
Here's some relevant code, the one I'm more interested is the first case. I was thinking it's possible to use std::accumulate with a lambda to capture the indices vector, but then I got wondering, if there's already a neater way, perhaps with some other operator. Not a "real problem" as looping is quite clear for my tastes too, but in hunt for the super-neat or more efficient on-liner...
template<typename out_type>
out_type sum(std::vector<float> const& matrix, std::vector<int> const& indices)
{
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; ++i)
{
const int index = indices.size() * indices[i] + indices[i + 1];
cost += matrix[index];
}
const int index = indices.size() * indices[indices.size() - 1] + indices[0];
cost += matrix[index];
return cost;
}
template<typename out_type>
out_type sum(std::vector<std::vector<float>> const& matrix, std::vector<int> const& indices)
{
out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++)
{
cost += matrix[indices[i]][indices[i + 1]];
}
cost += matrix[indices[indices.size() - 1]][indices[0]];
return cost;
}
Oh, and PPL/TBB are fair game too.
Edit
As an afterthought and as commented to John, would there be a place to employ std::common_type in the calculation as the input and output types may differ? This is a bit of hand-waving and more like learning techniques and libraries. A form of code kata, if you will.
Edit 2
Now, there's one option to make the loops faster, explained in blog writing How to process a STL vector using SSE code by a blogger theowl84. The code uses __m128 directly, but I wonder if there's something in DirectXMath library too.
Edit 3
Now, after writing some concrete code, I found std::accumulate wouldn't get me far. Or at least I couldn't find a way to do the [indices[i + 1] part in matrix[indices[i]][indices[i + 1]]; in a neat way, as std::accumulate itself gives access to only the current value and the sum. In that light, it looks like novelocrat's approach would be the most fruitful one.
DeadMG proposed using parallel_reduce with associativity caveats, further commented by novelocrat. I didn't go about seeing if I could use parallel_reduce, as the interface looked somewhat cumbersome for quick trying. Other than that, even though my code executes serially, it would suffer from the same floating some issues as the parallel reduction version. Though the parallel version would/could be (much) more unpredictable with than serial version, I think.
This goes somewhat tangential, but it may be of interest to some stumbling here, and to those of whom have read this far, may be (very) interested on article Wandering Precision in The NAG blog, which details some intricanciens even introduced by hardware instruction re-ordering! Then there are some ruminations about this very issue in distributed setting in #AltDevBlogADay Synchronous RTS Engines and a Tale of Desyncs. Also, ACCU (the general mailing list is excellent, by the way, and it's free to join) features several articles (e.g. this) on floating point accuracy. A tangential to tangential, I found Fernando Cacciola's Robustness issues in geometric computing to be a good article to read, originally from ACCU mailing list.
And then then the std::common_type. I couldn't find usage for that. If I had two different types as parameters, then the return value could/should be decided by std::common_type. Perhaps more pertinent is std::is_convertible with static_assert to make sure the desired result type is convertible from the argument types (with a clean error message). Other than that, I can only make up a check that the return value/intermediate calculation value accurracy is sufficient to represent the result of summation without overflows and things like that, but I haven't come across a standard facility for that.
That about that, I think, ladies and gentlemen. I enjoyed myself, I hope those reading this got something out of this too.

You could produce an iterator that takes matrix and indices and yields the appropriate values.
class route_iterator
{
vector<vector<float>> const& matrix;
vector<int> const& indices;
int i;
public:
route_iterator(vector<vector<float>> const& matrix_, vector<int> const& indices_,
int begin = 0)
: matrix(matrix_), indices(indices_), i(begin)
{ }
float operator*() {
return matrix[indices[i]][indices[(i + 1) % indices.size()]];
}
route_iterator& operator++() {
++i;
return *this;
}
};
Then your accumulate runs from route_iterator(matrix, indices) to route_iterator(matrix, indices, indices.size()).
Admittedly, though, this sequentializes without a smart compiler turning it into something parallel. What you really want are parallel map and fold (accumulate) operations.

out_type cost = 0;
for(decltype(indices.size()) i = 0; i < indices.size() - 1; i++)
{
cost += matrix[indices[i]][indices[i + 1]];
}
This is basically std::accumulate. PPL provides (and so does TBB, if I recall) parallel_reduce. This requires associativity but not commutivity, and + over the real/float/integer is associative.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js