c++ why std::async slower than sequential execution

c++ why std::async slower than sequential execution - c++

#include <future>
#include <iostream>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <random>
#include <chrono>
#include <utility>
#include <type_traits>
template <class Clock = std::chrono::high_resolution_clock, class Task>
double timing(Task&& t, typename std::result_of<Task()>::type* r = nullptr)
{
using namespace std::chrono;
auto begin = Clock::now();
if (r != nullptr) *r = std::forward<Task>(t)();
auto end = Clock::now();
return duration_cast<duration<double>>(end - begin).count();
}
template <typename Num>
double sum(const std::vector<Num>& v, const std::size_t l, const std::size_t h)
{
double s;
for (auto i = l; i <= h; i++) s += v[i];
return s;
}
template <typename Num>
double asum(const std::vector<Num>& v, const std::size_t l, const std::size_t h)
{
auto m = (l + h) / 2;
auto s1 = std::async(std::launch::async, sum<Num>, v, l, m);
auto s2 = std::async(std::launch::async, sum<Num>, v, m+1, h);
return s1.get() + s2.get();
}
int main()
{
std::vector<uint> v(1000);
auto s = std::chrono::system_clock::now().time_since_epoch().count();
std::generate(v.begin(), v.end(), std::minstd_rand0(s));
double r;
std::cout << 1000 * timing([&]() -> double { return asum(v, 0, v.size() - 1); }, &r) << " msec | rst " << r << std::endl;
std::cout << 1000 * timing([&]() -> double { return sum(v, 0, v.size() - 1); }, &r) << " msec | rst " << r << std::endl;
}
Hi,
So above are two functions for summing up a vector of random numbers.
I did several runs, but it seems that I did not benefit from std::async. Below are some results I got.
0.130582 msec | rst 1.09015e+12
0.001402 msec | rst 1.09015e+12
0.23185 msec | rst 1.07046e+12
0.002308 msec | rst 1.07046e+12
0.18052 msec | rst 1.07449e+12
0.00244 msec | rst 1.07449e+12
0.190455 msec | rst 1.08319e+12
0.002315 msec | rst 1.08319e+12
All four cases the async version spent more time. But ideally I should have been two times faster right?
Did I miss anything in my code?
By the way I am running on OS X 10.10.4 Macbook Air with 1.4 GHz Intel Core i5.
Thanks,
Edits:
compiler flags: g++ -o asum asum.cpp -std=c++11
I changed the flag to include -O3 and vector size to be 10000000, but the results are still weired.
72.1743 msec | rst 1.07349e+16
14.3739 msec | rst 1.07349e+16
58.3542 msec | rst 1.07372e+16
12.1143 msec | rst 1.07372e+16
57.1576 msec | rst 1.07371e+16
11.9332 msec | rst 1.07371e+16
59.9104 msec | rst 1.07395e+16
11.9923 msec | rst 1.07395e+16
64.032 msec | rst 1.07371e+16
12.0929 msec | rst 1.07371e+16

here
auto s1 = std::async(std::launch::async, sum<Num>, v, l, m);
auto s2 = std::async(std::launch::async, sum<Num>, v, m+1, h);
async will store its own vector copy, twice. You should use std::cref and make sure the futures are retrieved before the vector dies ( as it is in your current code ) and that accesses get properly synchronized ( as it is in your current code ).
As mentioned in comments, thread creation overhead may further slow down the code.

Well, this is the simplest possible example and the results should not be binding because of following reasons.
when you create a thread, it takes some extra CPU cycles to create thread context and stack. Those cycles are added to the sum function.
When main thread runs this code, the main thread was empty and not doing anything else other than doing the sum
we go for multithreaded solution only when we can't accomplish something in a single thread or we need to synchronously wait for some input.
Just by creating threads, you don't increase performance. You increase performance by carefully designing multithreaded applications where you can use empty CPU's when some other things are waiting for example IO

First, the performance of your original async function is bad compared with the sequential because it make one more copy of your test data as mentioned in other answers. Second, you might not be able to see the improvement after fixing copying issue because creating threads is not cheap and it can kill your performance gain.
From the benchmark results, I can see async version is 1.88 times faster than that of the sequential version for N = 1000000. However, if I use N = 10000 then async version is 3.55 times slower. Both non-iterator and iterator solutions produce similar results.
Beside that you should use iterator when writing your code because this approach is more flexible for example you can try different container types, will give you similar performance compared to C style version, and it also is more elegant IMHO :).
Benchmark results:
-----------------------------------------------------------------------------------------------------------------------------------------------
Group | Experiment | Prob. Space | Samples | Iterations | Baseline | us/Iteration | Iterations/sec |
-----------------------------------------------------------------------------------------------------------------------------------------------
sum | orginal | Null | 50 | 10 | 1.00000 | 1366.30000 | 731.90 |
sum | orginal_async | Null | 50 | 10 | 0.53246 | 727.50000 | 1374.57 |
sum | iter | Null | 50 | 10 | 1.00022 | 1366.60000 | 731.74 |
sum | iter_async | Null | 50 | 10 | 0.53261 | 727.70000 | 1374.19 |
Complete.
Celero
Timer resolution: 0.001000 us
-----------------------------------------------------------------------------------------------------------------------------------------------
Group | Experiment | Prob. Space | Samples | Iterations | Baseline | us/Iteration | Iterations/sec |
-----------------------------------------------------------------------------------------------------------------------------------------------
sum | orginal | Null | 50 | 10 | 1.00000 | 13.60000 | 73529.41 |
sum | orginal_async | Null | 50 | 10 | 3.55882 | 48.40000 | 20661.16 |
sum | iter | Null | 50 | 10 | 1.00000 | 13.60000 | 73529.41 |
sum | iter_async | Null | 50 | 10 | 3.53676 | 48.10000 | 20790.02 |
Complete.
Complete code sample
#include <algorithm>
#include <chrono>
#include <future>
#include <iostream>
#include <random>
#include <type_traits>
#include <vector>
#include <iostream>
#include "celero/Celero.h"
constexpr int NumberOfSamples = 50;
constexpr int NumberOfIterations = 10;
template <typename Container>
double sum(const Container &v, const std::size_t begin, const std::size_t end) {
double s = 0;
for (auto idx = begin; idx < end; ++idx) {
s += v[idx];
}
return s;
}
template <typename Container>
double sum_async(const Container &v, const std::size_t begin, const std::size_t end) {
auto middle = (begin + end) / 2;
// Removing std::cref will slow down this function because it makes two copy of v..
auto s1 = std::async(std::launch::async, sum<Container>, std::cref(v), begin, middle);
auto s2 = std::async(std::launch::async, sum<Container>, std::cref(v), middle, end);s,
return s1.get() + s2.get();
}
template <typename Iter>
typename std::iterator_traits<Iter>::value_type sum_iter(Iter begin, Iter end) {
typename std::iterator_traits<Iter>::value_type results = 0.0;
std::for_each(begin, end, [&results](auto const item) { results += item; });
return results;
}
template <typename Iter>
typename std::iterator_traits<Iter>::value_type sum_iter_async(Iter begin, Iter end) {
Iter middle = begin + std::distance(begin, end) / 2;
auto s1 = std::async(std::launch::async, sum_iter<Iter>, begin, middle);
auto s2 = std::async(std::launch::async, sum_iter<Iter>, middle, end);
return s1.get() + s2.get();
}
template <typename T> auto create_test_data(const size_t N) {
auto s = std::chrono::system_clock::now().time_since_epoch().count();
std::vector<T> v(N);
std::generate(v.begin(), v.end(), std::minstd_rand0(s));
return v;
}
// Create test data
constexpr size_t N = 10000;
using value_type = double;
auto data = create_test_data<value_type>(N);
using container_type = decltype(data);
CELERO_MAIN
BASELINE(sum, orginal, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum<container_type>(data, 0, N));
}
BENCHMARK(sum, orginal_async, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum_async<container_type>(data, 0, N));
}
BENCHMARK(sum, iter, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum_iter(data.cbegin(), data.cend()));
}
BENCHMARK(sum, iter_async, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum_iter_async(data.cbegin(), data.cend()));

Related

How to remove the Nth element in a range?

I can write:
my_range | ranges::views::remove(3)
using the ranges-v3 library, to remove the element(s) equal to 3 from the range my_range. This can also be done in C++20 with
my_range | std::views::filter([](auto const& val){ return val != 3; })
But - how can I remove the element at position 3 in my_range, keeping the elements at positions 0, 1, 2, 4, 5 etc.?

Here's one way to do it:
#include <iostream>
#include <ranges>
#include <range/v3/view/take.hpp>
#include <range/v3/view/drop.hpp>
#include <range/v3/view/concat.hpp>
int main() {
const auto my_range = { 10, 20, 30, 40, 50, 60, 70 };
auto index_to_drop = 3; // so drop the 40
auto earlier = my_range | ranges::views::take(index_to_drop - 1);
auto later = my_range | ranges::views::drop(index_to_drop);
auto both = ranges::views::concat(earlier, later);
for (auto const & num : both) { std::cout << num << ' '; }
}
This will produce:
10 20 30 50 60 70
... without the 40.
See it working on Godbolt. Compilation time is extremely poor though. Also, concat() is not part of C++20. Maybe in C++23?

The most straightforward way I can think of in range-v3 would be:
auto remove_at_index(size_t idx) {
namespace rv = ranges::views;
return rv::enumerate
| rv::filter([=](auto&& pair){ return pair.first != idx; })
| rv::values;
}
To be used like:
my_range | remove_at_index(3);
enumerate (and its more general cousin zip) is not in C++20, but will hopefully be in C++23.

Understanding composition of lazy range-based functions

TL;DR
What's wrong with the last commented block of lines below?
// headers and definitions are in the down the question
int main() {
std::vector<int> v{10,20,30};
using type_of_temp = std::vector<std::pair<std::vector<int>,int>>;
// seems to work, I think it does work
auto temp = copy_range<type_of_temp>(v | indexed(0)
| transformed(complex_keep_index));
auto w = temp | transformed(distribute);
print(w);
// shows undefined behavior
//auto z = v | indexed(0)
// | transformed(complex_keep_index)
// | transformed(distribute);
//print(z);
}
Or, in other words, what makes piping v | indexed(0) into transformed(complex_keep_index) well defined, but piping v | indexed(0) | transformed(complex_keep_index) into transformed(distribute) undefined behavior?
Extended version
I have a container of things,
std::vector<int> v{10,20,30};
and I have a function which generates another container from each of those things,
// this is in general a computation of type
// T -> std::vector<U>
constexpr auto complex_comput = [](auto const& x){
return std::vector{x,x+1,x+2}; // in general the number of elements changes
};
so if I was to apply the complex_comput to v, I'd get,
{{10, 11, 12}, {20, 21, 22}, {30, 31, 32}}
and if I was to also concatenate the results, I'd finally get this:
{10, 11, 12, 20, 21, 22, 30, 31, 32}
However, I want to keep track of the index where each number came from, in a way that the result would encode something like this:
0 10
0 11
0 12
1 20
1 21
1 22
2 30
2 31
2 32
To accomplish this, I (eventually) came up with this solution, where I attempted to make use of ranges from Boost. Specifically I do the following:
use boost::adaptors::indexed to attach the index to each element of v
transform each resulting "pair" in a std::pair storing the index and result of the application of complex_comput to the value,
and finally transforming each std::pair<st::vector<int>,int> in a std::vector<std::pair<int,int>>.
However, I had to give up on the range between 2 and 3, using a helper "true" std::vector in between the two transformations.
#include <boost/range/adaptor/indexed.hpp>
#include <boost/range/adaptor/transformed.hpp>
#include <boost/range/iterator_range_core.hpp>
#include <iostream>
#include <utility>
#include <vector>
using boost::adaptors::indexed;
using boost::adaptors::transformed;
using boost::copy_range;
constexpr auto complex_comput = [](auto const& x){
// this is in general a computation of type
// T -> std::vector<U>
return std::vector{x,x+1,x+2};
};
constexpr auto complex_keep_index = [](auto const& x){
return std::make_pair(complex_comput(x.value()), x.index());
};
constexpr auto distribute = [](auto const& pair){
return pair.first | transformed([n = pair.second](auto x){
return std::make_pair(x, n);
});
};
template<typename T>
void print(T const& w) {
for (auto const& elem : w) {
for (auto i : elem) {
std::cout << i.second << ':' << i.first << ' ';
}
std::cout << std::endl;
}
}
int main() {
std::vector<int> v{10,20,30};
using type_of_temp = std::vector<std::pair<std::vector<int>,int>>;
auto temp = copy_range<type_of_temp>(v | indexed(0)
| transformed(complex_keep_index));
auto w = temp | transformed(distribute);
print(w);
//auto z = v | indexed(0)
// | transformed(complex_keep_index)
// | transformed(distribute);
//print(z);
}
Indeed, decommenting the lines defining and using z gives you a code that compiles but generates rubbish results, i.e. undefined behavior. Note that applying copy_range<type_of_temp> to the first, working, range is necessary, otherwise the resulting code is essetially the same as the one on the right of auto z =.
Why do I have to do so? What are the details that makes the oneliner not work?
I partly understand the reason, and I'll list my understanding/thoughts in the following, but I'm asking this question to get a thorough explanation of all the details of this.
I understand that the undefined behavior I observe stems from z being a range whose defining a view on some temporary which has been destroyed;
given the working version of the code, it is apparent that temporary is v | indexed(0) | transformed(complex_keep_index);
however, isn't v | indexed(0) itself a temporary that is fed to transformed(complex_keep_index)?
Probably one important detail is that the expression v | indexed(0) is no more than a lazy range, which evaluates nothing, but just sets things up such that when one iterates on the range the computations is done; after all I can easily do v | indexed(0) | indexed(0) | indexed(0), which is well defined;
and also the whole v | indexed(0) | transformed(complex_keep_index) is well defined, otherwise the code above using w would probably misbehave (I know that UB doesn't mean that the result has to show something is wrong, and things could just look ok on this hardware, in this moment, and break tomorrow).
So there's something inherently wrong is passing an rvalue to transformed(distribute);
but what's wrong in doing so lies in distribute, not in transformed, because for instance changing distribute to [](auto x){ return x; } seems to be well defined.
So what's wrong with distribute? Here's the code
constexpr auto distribute = [](auto const& pair){
return pair.first | transformed([n = pair.second](auto x){
return std::make_pair(x, n);
});
};
What's the problem with it? The returned range (output of this transformed) will hold some iterators/pointers/references to pair.first which is part of goes out of scope when distribute returns, but pair is a reference to something in the caller, which keeps living, right?
However I know that even though a const reference (e.g. pair) can keep a temporary (e.g. the elements of v | indexed(0) | transformed(complex_keep_index)) alive, that doesn't mean that the temporary stays alive when that reference goes out of scope just because it is in turn referenced by something else (references/pointers/iterators in the output of transformed([n = …](…){ … })) which doesn't go out of scope.
I think/hope that probably the answer is already in what I've written above, however I need some help to streamline all of that so that I can understand it once and for all.

Why is (n += 2 * i * i) faster than (n+= i) in C++?

This C++11 program takes on average between 7.42s and 7.79s to run.
#include <iostream>
#include <chrono>
using namespace std;
using c = chrono::system_clock;
using s = chrono::duration<double>;
void func(){
int n=0;
const auto before = c::now();
for(int i=0; i<2000000000; i++){
n += i;
}
const s duration = c::now() - before;
cout << duration.count();
}
if I replace n += i with n += 2 * i * i it takes between 5.80s and 5.96s. how come?
I ran each version of the program 20 times, alternating between the two. Here are the results:
n += i | n += 2 * i * i
---------+----------------
7.77047 | 5.87978
7.69226 | 5.83551
7.77375 | 5.84888
7.73748 | 5.84629
7.72988 | 5.84356
7.69736 | 5.83784
7.72597 | 5.84246
7.72722 | 5.81678
7.73291 | 5.81237
7.71871 | 5.81016
7.7478 | 5.80119
7.64906 | 5.80058
7.7253 | 5.9078
7.42734 | 5.96399
7.72573 | 5.84733
7.65591 | 5.81793
7.76619 | 5.83116
7.76963 | 5.84424
7.79928 | 5.87078
7.79274 | 5.84689
I have compiled it with (GCC) 9.1.1 20190503 (Red Hat 9.1.1-1). No optimization levels
g++ -std=c++11
We know that the maximum integer is ~ 2 billion. So, when i ~ 32000, can we say that the compiler predicts that the calculation will overflow?

https://godbolt.org/z/B3zIsv
You'll notice that with -O2, the code used to calculate 'n' is removed completely. So the real questions should be:
Why are you profiling code without -O2?
Why are you profiling code that has no observable side effects? ('n' can be removed completely - e.g. printing the value of 'n' at the end would be more useful here)
Why are you not profiling code in a profiler?
The timing results you have, result from a deeply flawed methodology.

Overflowed value prints non-overflowed value on Arduino

First of all, I know the problem with the code and how to get it to work. I'm mainly looking for an explanation why my output is what it is.
The following piece of code replicates the behaviour:
void setup() {
Serial.begin(9600);
}
void loop() {
Serial.println("start");
for(int i = 0; i < 70000; i++) {
if((i % 2000) == 0)
Serial.println(i);
}
}
Obviously the for loop will run forever because i will overflow at 32,767. I would expect it to overflow and print -32000.
Expected| Actually printed
0 | 0
2000 | 2000
4000 | 4000
... | ...
30000 | 30000
32000 | 32000
-32000 | 33536
-30000 | 35536
It looks like it prints the actual iterations, since if you overflow and count to -32000 you would have 33536 iterations, but I can't figure out how it's able to print the 33536.
The same thing happens every 65536 iterations:
95536 | 161072 | 226608 | 292144 | 357680
97536 | 163072 | 228608 | 294144 | 359680
99072 | 164608 | 230144 | 295680 | 361216
101072 | 166608 | 232144 | 297680 | 363216
EDIT
When I change the loop to add 10.000 every iteration and only print every 1.000.000 to speed it up the Arduino crashes (or at least, the prints stop) at '2.147.000.000'. Which seems to point to the 32-bit idea of svtag**
EDIT2
The edits I made for the 2.147.000.000-check:
void loop() {
Serial.println("start");
for(int i = 0; i < 70000; i+=10000) {
if((i % 1000000) == 0)
Serial.println(i);
}
}
They work in the same trend with the previous example, printing 0, 1000000, 2000000,...
However, when I update the AVR package from 1.6.17 to the latest 1.6.23 I get only 0's. The original example (% 2000 & i++) still gives the same output.

The compiler may have automatically casted the i to usigned int when you are doing the println or the value is going to the wrong overload. Try using Serial.println((int)i);

Running std::normal_distribution with user-defined random generator

I am about to generate an array of normally distributed pseudo-random numbers. As I know the std library offers the following code for that:
std::random_device rd;
std::mt19937 gen(rd());
std::normal_distribution<> d(mean,std);
...
double number = d(gen);
The problem is that I want to use a Sobol' quasi-random sequence instead of Mersenne
Twister pseudo-random generator. So, my question is:
Is it possible to run the std::normal_distribution with a user-defined random generator (with a Sobol' quasi-random sequence generator in my case)?
More details: I have a class called RandomGenerators, which is used to generate a Sobol' quasi-random numbers:
RandomGenerator randgen;
double number = randgen.sobol(0,1);

Yes, it is possible. Just make it comply to the requirements of a uniform random number generator (§26.5.1.3 paragraphs 2 and 3):
2 A class G satisfies the requirements of a uniform random number
generator if the expressions shown in Table 116 are valid and have the
indicated semantics, and if G also satisfies all other requirements
of this section. In that Table and throughout this section:
a) T is the type named by G’s associatedresult_type`, and
b) g is a value of G.
Table 116 — Uniform random number generator requirements
Expression | Return type | Pre/post-condition | Complexity
----------------------------------------------------------------------
G::result_type | T | T is an unsigned integer | compile-time
| | type (§3.9.1). |
----------------------------------------------------------------------
g() | T | Returns a value in the | amortized constant
| | closed interval |
| | [G::min(), G::max()]. |
----------------------------------------------------------------------
G::min() | T | Denotes the least value | compile-time
| | potentially returned by |
| | operator(). |
----------------------------------------------------------------------
G::max() | T | Denotes the greatest value | compile-time
| | potentially returned by |
| | operator(). |
3 The following relation shall hold: G::min() < G::max().

A word of caution here - I came across a big gotcha when I implemented this. It seems that if the return types of max()/min()/operator() are not 64 bit then the distribution will resample. My (unsigned) 32 bit Sobol implementation was getting sampled twice per deviate thus destroying the properties of the numbers. This code reproduces:
#include <random>
#include <limits>
#include <iostream>
#include <cstdint>
typedef uint32_t rng_int_t;
int requested = 0;
int sampled = 0;
struct Quasi
{
rng_int_t operator()()
{
++sampled;
return 0;
}
rng_int_t min() const
{
return 0;
}
rng_int_t max() const
{
return std::numeric_limits<rng_int_t>::max();
}
};
int main()
{
std::uniform_real_distribution<double> dist(0.0,1.0);
Quasi q;
double total = 0.0;
for (size_t i = 0; i < 10; ++i)
{
dist(q);
++requested;
}
std::cout << "requested: " << requested << std::endl;
std::cout << "sampled: " << sampled << std::endl;
}
Output (using g++ 5.4):
requested: 10
sampled: 20
and even when compiled with -m32. If you change rng_int_t to 64bit the problem goes away. My workaround is to stick the 32 bit value into the most significant bits of the return value, e.g
return uint64_t(val) << 32;

You can now generate Sobol sequences directly with Boost. See boost/random/sobol.hpp.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

c++ why std::async slower than sequential execution - c++

Related

How to remove the Nth element in a range?

Understanding composition of lazy range-based functions

Why is (n += 2 * i * i) faster than (n+= i) in C++?

Overflowed value prints non-overflowed value on Arduino

Running std::normal_distribution with user-defined random generator

Categories

Resources