Why would a parallel version of accumulate be so much slower?

Why would a parallel version of accumulate be so much slower? - c++

Inspired by Antony Williams' "C++ Concurrency in Action" I took a closer look at his parallel version of std::accumulate. I copied its code from the book and added some output for debugging purposes and this is what I ended up with:
#include <algorithm>
#include <future>
#include <iostream>
#include <thread>
template <typename Iterator, typename T>
struct accumulate_block
{
T operator()(Iterator first, Iterator last)
{
return std::accumulate(first, last, T());
}
};
template <typename Iterator, typename T>
T parallel_accumulate(Iterator first, Iterator last, T init)
{
const unsigned long length = std::distance(first, last);
if (!length) return init;
const unsigned long min_per_thread = 25;
const unsigned long max_threads = (length) / min_per_thread;
const unsigned long hardware_conc = std::thread::hardware_concurrency();
const unsigned long num_threads = std::min(hardware_conc != 0 ? hardware_conc : 2, max_threads);
const unsigned long block_size = length / num_threads;
std::vector<std::future<T>> futures(num_threads - 1);
std::vector<std::thread> threads(num_threads - 1);
Iterator block_start = first;
for (unsigned long i = 0; i < (num_threads - 1); ++i)
{
Iterator block_end = block_start;
std::advance(block_end, block_size);
std::packaged_task<T(Iterator, Iterator)> task{accumulate_block<Iterator, T>()};
futures[i] = task.get_future();
threads[i] = std::thread(std::move(task), block_start, block_end);
block_start = block_end;
}
T last_result = accumulate_block<Iterator, T>()(block_start, last);
for (auto& t : threads) t.join();
T result = init;
for (unsigned long i = 0; i < (num_threads - 1); ++i) {
result += futures[i].get();
}
result += last_result;
return result;
}
template <typename TimeT = std::chrono::microseconds>
struct measure
{
template <typename F, typename... Args>
static typename TimeT::rep execution(F func, Args&&... args)
{
using namespace std::chrono;
auto start = system_clock::now();
func(std::forward<Args>(args)...);
auto duration = duration_cast<TimeT>(system_clock::now() - start);
return duration.count();
}
};
template <typename T>
T parallel(const std::vector<T>& v)
{
return parallel_accumulate(v.begin(), v.end(), 0);
}
template <typename T>
T stdaccumulate(const std::vector<T>& v)
{
return std::accumulate(v.begin(), v.end(), 0);
}
int main()
{
constexpr unsigned int COUNT = 200000000;
std::vector<int> v(COUNT);
// optional randomising vector contents - std::accumulate also gives 0us
// but custom parallel accumulate gives longer times with randomised input
std::mt19937 mersenne_engine;
std::uniform_int_distribution<int> dist(1, 100);
auto gen = std::bind(dist, mersenne_engine);
std::generate(v.begin(), v.end(), gen);
std::fill(v.begin(), v.end(), 1);
auto v2 = v; // copy to work on the same data
std::cout << "starting ... " << '\n';
std::cout << "std::accumulate : \t" << measure<>::execution(stdaccumulate<int>, v) << "us" << '\n';
std::cout << "parallel: \t" << measure<>::execution(parallel<int>, v2) << "us" << '\n';
}
What is most interesting here is that almost always I will get 0 length time from std::accumulate.
Exemplar output:
starting ...
std::accumulate : 0us
parallel:
inside1 54us
inside2 81830us
inside3 89082us
89770us
What is the problem here?
http://cpp.sh/6jbt

As is the usual case with micro-benchmarking, you need to make sure that your code is actually doing something. You're doing an accumulate, but you're not actually storing the result anywhere or doing anything with it. So do you really need to have done any of the work anyway? The compiler just snipped out all that logic in the normal case. That's why you get 0.
Just change your code to actually ensure that work needs to be done. For example:
int s, s2;
std::cout << "starting ... " << '\n';
std::cout << "std::accumulate : \t"
<< measure<>::execution([&]{s = std::accumulate(v.begin(), v.end(), 0);})
<< "us\n";
std::cout << "parallel: \t"
<< measure<>::execution([&]{s2 = parallel_accumulate(v2.begin(), v2.end(), 0);})
<< "us\n";
std::cout << s << ',' << s2 << std::endl;

Related

thread doesnot update referenced variable

#include<iostream>
#include<vector>
#include<cstdlib>
#include<thread>
#include<array>
#include<iterator>
#include<algorithm>
#include<functional>
#include<numeric>//accumulate
int sum_of_digits(int num, int sum = 0) {
//calculate sum of digits recursively till the sum is single digit
if (num == 0) {
if (sum / 10 == 0)
return sum;
return sum_of_digits(sum);
}
return sum_of_digits(num / 10, sum + num % 10);
}
template<typename T, typename Iterator>
int sum_of_digits(Iterator begin, Iterator end) {
//not for array temp workout
T copy(std::distance(begin,end));
std::copy(begin, end, copy.begin());
std::for_each(copy.begin(), copy.end(), [](int& i) {i = sum_of_digits(i); });
return sum_of_digits(std::accumulate(begin, end, 0));
}
template<typename T, typename Iterator>
void sum_of_digits(Iterator begin, Iterator end, int& sum) {
sum = sum_of_digits<T, Iterator>(begin, end);
}
template<typename T>
int series_computation(const T& container) {
return sum_of_digits<T, typename T::const_iterator>(container.begin(), container.end());
}
#define MIN_THREAD 4
#define DEBUGG
template<typename T>
int parallel_computation(const T& container, const int min_el) {
if (container.size() < min_el) {
#ifdef DEBUGG
std::cout << "no multithreading" << std::endl;
#endif
return series_computation<T>(container);
}
const unsigned int total_el = container.size();
const unsigned int assumed_thread = total_el / min_el;
const unsigned hardwarethread = std::thread::hardware_concurrency();
const unsigned int thread_count = std::min<unsigned int>(hardwarethread == 0 ? MIN_THREAD : hardwarethread, assumed_thread) - 1;//one thread is main thread
const unsigned int el_per_thread = total_el / thread_count;
#ifdef DEBUGG
std::cout << "thread count: " << thread_count << " element per thread: " << el_per_thread << std::endl;
#endif
std::vector<std::thread> threads;
threads.reserve(thread_count);
using result_type = std::vector<int>;
result_type results(thread_count);
results.reserve(thread_count);
auto it_start = container.begin();
for (int i = 0; i < thread_count; i++) {
auto it_end = it_start;
std::advance(it_end, el_per_thread);
threads.push_back(std::thread([&]() {sum_of_digits<T, typename T::const_iterator>(it_start, it_end, std::ref(results[i])); }));
//threads.push_back( std::thread{ sum_of_digits<T,typename T::const_iterator>, it_start, it_end, std::ref(results[i]) });
it_start = it_end;
std::cout << "iterator " << i << std::endl;
}
results[thread_count - 1] = sum_of_digits<T, typename T::const_iterator>(it_start, container.end());
std::for_each(threads.begin(), threads.end(), std::mem_fn(&std::thread::join));
return series_computation<T>(results);
}
#define SIZE 1000
int main() {
std::vector<int> array(SIZE);
for (auto& curr : array) {
curr = std::rand();
}
for (auto& curr : array) {
//std::cout <<curr<< std::endl;
}
int series_val = series_computation<std::vector<int>>(array);
std::cout << "series val: " << series_val << std::endl;
int parallel_val = parallel_computation<std::vector<int>>(array, 25);
std::cout << "parallel val: " << parallel_val << std::endl;
return 0;
}
I am trying to calculate the sum of digits(recursive) of a randomly generated vector using std::thread but in the results vector only the result of the last element(i.e main thread) is stored and the child thread doesnot update the referenced results[i].
What is causing this behaviour?
Also, inside the for loop of parallel_combination, this code works
threads.push_back(std::thread([&]() {sum_of_digits<T, typename T::const_iterator>(it_start, it_end, std::ref(results[i])); }));
but this doesnot and the error is: Error: '<function-style-cast>': cannot convert from 'initializer list' to 'std::thread'.
What's wrong with the below one?
threads.push_back( std::thread{ sum_of_digits<T,typename T::const_iterator>, it_start, it_end, std::ref(results[i]) });

Your code has a data race here:
threads.push_back(std::thread([&]() {sum_of_digits<T, typename T::const_iterator>(it_start, it_end, std::ref(results[i])); }));
it_start = it_end;
The lambda uses it_start and then without any synchronization you immediately modify it in the main thread. Capture it_start and it_end by-copy. Also std::ref is pointless there. You are not passing a reference to the std::thread constructor, but just to a normal direct function call. Also, as I mentioned in the comments, template arguments in a function call can be deduced:
threads.push_back(std::thread([it_start,it_end,&results](){
sum_of_digits(it_start, it_end, results[i]); }));
it_start = it_end;
threads.push_back( std::thread{ sum_of_digits<T,typename T::const_iterator>, it_start, it_end, std::ref(results[i]) });
fails, because there are two function templates of which sum_of_digits<T,typename T::const_iterator> could be a specialization. It is impossible to pass overload sets to functions and since there is no way to deduce which one you mean, it will fail.
There are some smaller issues in the code, e.g. pointless reserve after resizing/construction of vectors or
std::for_each(threads.begin(), threads.end(), std::mem_fn(&std::thread::join));
which is not guaranteed to work, because taking the address of a member function of a standard library function has unspecified behavior. Instead just use a loop:
for(auto& t : threads) {
t.join();
}
and possibly some others.

Is there a way of making a copy of the vector to template?

Im trying to make a copy of the vector V.
I need to make a copy in the template (located as a comment) because im testing the running time of sort in this bechmark.
To make sure that a correct test of the sort running time is made, i need to make a copy because a running time of a sorted vector is different of the unsorted vector.
template <typename TFunc> void RunAndMeasure(const char* title, const int repeat, TFunc func)
{
for(int i = 0; i < repeat; ++i) {
//makecopyhere
const auto start = chrono::steady_clock::now();
func();
const auto end = chrono::steady_clock::now();
cout << title << ": " << chrono::duration <double, std::milli>(end - start).count() << " ms" << "\n";
}
cout<<"\n"<<endl;
}
int main()
{
long int samples=5000000;
constexpr int repeat{10};
random_device rd;
mt19937_64 mre(rd());
uniform_int_distribution<int> urd(1, 10);
vector<int> v(samples);
for(auto &e : v) {
e = urd(mre);
}
RunAndMeasure("std::warm up", repeat, [&v] {
vector <int> path= v;
sort(execution::par, path.begin(), path.end());
});
return 0;
}
Sorry for my poor english
Thank you

I want to remove the copy made in the fuction (vector path) and make a copy in the template only to mesure the sort operation time.
So, something like
template <typename TFunc, typename TArg> void RunAndMeasure(const char* title, const int repeat, TFunc func, TArg arg)
{
for(int i = 0; i < repeat; ++i) {
TArg copy = arg;
const auto start = chrono::steady_clock::now();
func(copy);
const auto end = chrono::steady_clock::now();
cout << title << ": " << chrono::duration <double, std::milli>(end - start).count() << " ms" << "\n";
}
cout<<"\n"<<endl;
}
int main()
{
long int samples=5000000;
constexpr int repeat{10};
random_device rd;
mt19937_64 mre(rd());
uniform_int_distribution<int> urd(1, 10);
vector<int> v(samples);
for(auto &e : v) {
e = urd(mre);
}
RunAndMeasure("std::warm up", repeat, [](vector<int> & path) {
sort(execution::par, path.begin(), path.end());
}, v);
return 0;
}

Why my program runs faster on 1 thread than on 8. C++

Please look at this code:
#include <iostream>
#include <thread>
#include <numeric>
#include <algorithm>
#include <vector>
#include <chrono>
template<typename Iterator, typename T>
struct accumulate_block
{
void operator()(Iterator begin, Iterator end, T& result)
{
result = std::accumulate(begin, end, result);
}
};
template<typename Iterator, typename T>
int accumulate_all(Iterator begin, Iterator end, T& init)
{
auto numOfThreads = std::thread::hardware_concurrency();
std::vector<std::thread> threads(numOfThreads);
auto step = std::distance(begin, end) / numOfThreads;
std::vector<int> results(numOfThreads,0);
for(int i=0; i<numOfThreads-1; ++i)
{
auto block_end = begin;
std::advance(block_end, step);
threads[i] = std::thread(accumulate_block<Iterator, T>(), begin, block_end, std::ref(results[i]));
begin = block_end;
}
threads[numOfThreads-1] = std::thread(accumulate_block<Iterator, T>(), begin, end, std::ref(results[numOfThreads-1]));
for_each(threads.begin(), threads.end(), std::mem_fn(&std::thread::join));
return accumulate(results.begin(), results.end(), 0);
}
int main()
{
int x=0;
std::vector<int> V(20000000,1);
auto t1 = std::chrono::high_resolution_clock::now();
//std::accumulate(std::begin(V), std::end(V), x); singe threaded option
std::cout<<accumulate_all(std::begin(V), std::end(V), x);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "process took: "
<< std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
<< " nanoseconds\n";
return 0;
}
When I run on concurrent version (basically on 8 threads because my std::thread::hardware_concurrency(); returns 8)
the output is: process took: 8895404 nanoseconds.
But the single threaded options output is: process took: 124 nanoseconds
Can anyone explain this strange behavior??

The compiler removes the call to std::accumulate because it does not have side effects and the result is not used.
Fix:
auto sum = std::accumulate(std::begin(V), std::end(V), x); // singe threaded option
// At the very end.
std::cout << sum << '\n';

Type safe index values for std::vector

I have classes that collect index values from different constant STL vectors. Problem is, even if these vectors are different in content and they have different purposes, their indexes are of type std::size_t, so one might erroneusly use the index stored for one vector to access the elements of another vector. Can the code be changed in order to have a compile time error when a index is not used with the correct vector?
A code example:
#include <iostream>
#include <string>
#include <vector>
struct Named
{
std::string name;
};
struct Cat : Named { };
struct Dog : Named { };
struct Range
{
std::size_t start;
std::size_t end;
};
struct AnimalHouse
{
std::vector< Cat > cats;
std::vector< Dog > dogs;
};
int main( )
{
AnimalHouse house;
Range cat_with_name_starting_with_a;
Range dogs_with_name_starting_with_b;
// ...some initialization code here...
for( auto i = cat_with_name_starting_with_a.start;
i < cat_with_name_starting_with_a.end;
++i )
{
std::cout << house.cats[ i ].name << std::endl;
}
for( auto i = dogs_with_name_starting_with_b.start;
i < dogs_with_name_starting_with_b.end;
++i )
{
// bad copy paste but no compilation error
std::cout << house.cats[ i ].name << std::endl;
}
return 0;
}
Disclaimer: please do not focus too much on the example itself, I know it is dumb, it is just to get the idea.

Here is an attempt following up on my comment.
There are of course a lot of room to change the details of how this would work depending on the use-case, this way seemed reasonable to me.
#include <iostream>
#include <vector>
template <typename T>
struct Range {
Range(T& vec, std::size_t start, std::size_t end) :
m_vector(vec),
m_start(start),
m_end(end),
m_size(end-start+1) {}
auto begin() {
auto it = m_vector.begin();
std::advance(it, m_start);
return it;
}
auto end() {
auto it = m_vector.begin();
std::advance(it, m_end + 1);
return it;
}
std::size_t size() {
return m_size;
}
void update(std::size_t start, std::size_t end) {
m_start = start;
m_end = end;
m_size = end - start + 1;
}
Range copy(T& other_vec) {
return Range(other_vec, m_start, m_end);
}
typename T::reference operator[](std::size_t index) {
return m_vector[m_start + index];
}
private:
T& m_vector;
std::size_t m_start, m_end, m_size;
};
// This can be used if c++17 is not supported, to avoid
// having to specify template parameters
template <typename T>
Range<T> make_range(T& t, std::size_t start, std::size_t end) {
return Range<T>(t, start, end);
}
int main() {
std::vector<int> v1 {1, 2, 3, 4, 5};
std::vector<double> v2 {0.5, 1., 1.5, 2., 2.5};
Range more_then_2(v1, 1, 4); // Only works in c++17 or later
auto more_then_1 = make_range(v2, 2, 4);
for (auto v : more_then_2)
std::cout << v << ' ';
std::cout << std::endl;
for (auto v : more_then_1)
std::cout << v << ' ';
std::cout << std::endl;
more_then_2.update(2,4);
for (auto v : more_then_2)
std::cout << v << ' ';
std::cout << std::endl;
auto v3 = v1;
auto more_then_2_copy = more_then_2.copy(v3);
for (unsigned i=0; i < more_then_2_copy.size(); ++i)
std::cout << more_then_2_copy[i] << ' ';
return 0;
}

What is the optimal way to concatenate two vectors whilst transforming elements of one vector?

Suppose I have
std::vector<T1> vec1 {/* filled with T1's */};
std::vector<T2> vec2 {/* filled with T2's */};
and some function T1 f(T2) which could of course be a lambda. What is the optimal way to concatenate vec1 and vec2 whilst applying f to each T2 in vec2?
The apparently obvious solution is std::transform, i.e.
vec1.reserve(vec1.size() + vec2.size());
std::transform(vec2.begin(), vec2.end(), std::back_inserter(vec1), f);
but I say this is not optimal as std::back_inserter must make an unnecessary capacity check on each inserted element. What would be optimal is something like
vec1.insert(vec1.end(), vec2.begin(), vec2.end(), f);
which could get away with a single capacity check. Sadly this is not valid C++. Essentially this is the same reason why std::vector::insert is optimal for vector concatenation, see this question and the comments in this question for further discussion on this point.
So:
Is std::transform the optimal method using the STL?
If so, can we do better?
Is there a good reason why the insert function described above was left out of the STL?
UPDATE
I've had a go at verifying if the multiple capacity checks do have any noticeable cost. To do this I basically just pass the id function (f(x) = x) to the std::transform and push_back methods discussed in the answers. The full code is:
#include <iostream>
#include <vector>
#include <iterator>
#include <algorithm>
#include <cstdint>
#include <chrono>
#include <numeric>
#include <random>
using std::size_t;
std::vector<int> generate_random_ints(size_t n)
{
std::default_random_engine generator;
auto seed1 = std::chrono::system_clock::now().time_since_epoch().count();
generator.seed((unsigned) seed1);
std::uniform_int_distribution<int> uniform {};
std::vector<int> v(n);
std::generate_n(v.begin(), n, [&] () { return uniform(generator); });
return v;
}
template <typename D=std::chrono::nanoseconds, typename F>
D benchmark(F f, unsigned num_tests)
{
D total {0};
for (unsigned i = 0; i < num_tests; ++i) {
auto start = std::chrono::system_clock::now();
f();
auto end = std::chrono::system_clock::now();
total += std::chrono::duration_cast<D>(end - start);
}
return D {total / num_tests};
}
template <typename T>
void std_insert(std::vector<T> vec1, const std::vector<T> &vec2)
{
vec1.insert(vec1.end(), vec2.begin(), vec2.end());
}
template <typename T1, typename T2, typename UnaryOperation>
void push_back_concat(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
vec1.reserve(vec1.size() + vec2.size());
for (const auto& x : vec2) {
vec1.push_back(op(x));
}
}
template <typename T1, typename T2, typename UnaryOperation>
void transform_concat(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
vec1.reserve(vec1.size() + vec2.size());
std::transform(vec2.begin(), vec2.end(), std::back_inserter(vec1), op);
}
int main(int argc, char **argv)
{
unsigned num_tests {1000};
size_t vec1_size {10000000};
size_t vec2_size {10000000};
auto vec1 = generate_random_ints(vec1_size);
auto vec2 = generate_random_ints(vec1_size);
auto f_std_insert = [&vec1, &vec2] () {
std_insert(vec1, vec2);
};
auto f_push_back_id = [&vec1, &vec2] () {
push_back_concat(vec1, vec2, [] (int i) { return i; });
};
auto f_transform_id = [&vec1, &vec2] () {
transform_concat(vec1, vec2, [] (int i) { return i; });
};
auto std_insert_time = benchmark<std::chrono::milliseconds>(f_std_insert, num_tests).count();
auto push_back_id_time = benchmark<std::chrono::milliseconds>(f_push_back_id, num_tests).count();
auto transform_id_time = benchmark<std::chrono::milliseconds>(f_transform_id, num_tests).count();
std::cout << "std_insert: " << std_insert_time << "ms" << std::endl;
std::cout << "push_back_id: " << push_back_id_time << "ms" << std::endl;
std::cout << "transform_id: " << transform_id_time << "ms" << std::endl;
return 0;
}
Compiled with:
g++ vector_insert_demo.cpp -std=c++11 -O3 -o vector_insert_demo
Output:
std_insert: 44ms
push_back_id: 61ms
transform_id: 61ms
The compiler will have inlined the lambda, so that cost can be safely be discounted. Unless anyone else has a viable explanation for these results (or is willing to check the assembly), I think it's reasonable to conclude there is a noticeable cost of the multiple capacity checks.

UPDATE: The performance difference is due to the reserve() calls, which, in libstdc++ at least, make the capacity be exactly what you request instead of using the exponential growth factor.
I did some timing tests, with interesting results. Using std::vector::insert along with boost::transform_iterator was the fastest way I found by a large margin:
Version 1:
void
appendTransformed1(
std::vector<int> &vec1,
const std::vector<float> &vec2
)
{
auto v2begin = boost::make_transform_iterator(vec2.begin(),f);
auto v2end = boost::make_transform_iterator(vec2.end(),f);
vec1.insert(vec1.end(),v2begin,v2end);
}
Version 2:
void
appendTransformed2(
std::vector<int> &vec1,
const std::vector<float> &vec2
)
{
vec1.reserve(vec1.size()+vec2.size());
for (auto x : vec2) {
vec1.push_back(f(x));
}
}
Version 3:
void
appendTransformed3(
std::vector<int> &vec1,
const std::vector<float> &vec2
)
{
vec1.reserve(vec1.size()+vec2.size());
std::transform(vec2.begin(),vec2.end(),std::inserter(vec1,vec1.end()),f);
}
Timing:
Version 1: 0.59s
Version 2: 8.22s
Version 3: 8.42s
main.cpp:
#include <algorithm>
#include <cassert>
#include <chrono>
#include <iterator>
#include <iostream>
#include <random>
#include <vector>
#include "appendtransformed.hpp"
using std::cerr;
template <typename Engine>
static std::vector<int> randomInts(Engine &engine,size_t n)
{
auto distribution = std::uniform_int_distribution<int>(0,999);
auto generator = [&]{return distribution(engine);};
auto vec = std::vector<int>();
std::generate_n(std::inserter(vec,vec.end()),n,generator);
return vec;
}
template <typename Engine>
static std::vector<float> randomFloats(Engine &engine,size_t n)
{
auto distribution = std::uniform_real_distribution<float>(0,1000);
auto generator = [&]{return distribution(engine);};
auto vec = std::vector<float>();
std::generate_n(std::inserter(vec,vec.end()),n,generator);
return vec;
}
static auto
appendTransformedFunction(int version) ->
void(*)(std::vector<int>&,const std::vector<float> &)
{
switch (version) {
case 1: return appendTransformed1;
case 2: return appendTransformed2;
case 3: return appendTransformed3;
default:
cerr << "Unknown version: " << version << "\n";
exit(EXIT_FAILURE);
}
return 0;
}
int main(int argc,char **argv)
{
if (argc!=2) {
cerr << "Usage: appendtest (1|2|3)\n";
exit(EXIT_FAILURE);
}
auto version = atoi(argv[1]);
auto engine = std::default_random_engine();
auto vec1_size = 1000000u;
auto vec2_size = 1000000u;
auto count = 100;
auto vec1 = randomInts(engine,vec1_size);
auto vec2 = randomFloats(engine,vec2_size);
namespace chrono = std::chrono;
using chrono::system_clock;
auto appendTransformed = appendTransformedFunction(version);
auto start_time = system_clock::now();
for (auto i=0; i!=count; ++i) {
appendTransformed(vec1,vec2);
}
auto end_time = system_clock::now();
assert(vec1.size() == vec1_size+count*vec2_size);
auto sum = std::accumulate(vec1.begin(),vec1.end(),0u);
auto elapsed_seconds = chrono::duration<float>(end_time-start_time).count();
cerr << "Using version " << version << ":\n";
cerr << " sum=" << sum << "\n";
cerr << " elapsed: " << elapsed_seconds << "s\n";
}
Compiler: g++ 4.9.1
Options: -std=c++11 -O2

Is std::transform the optimal method using the STL?
I can't say that. If you reserve space, the difference should be ephemeral because the check might be optimized out by either the compiler or the CPU. The only way to find out is to measure your real code.
If you don't have a particular need, you should go for std::transform.
If so, can we do better?
What you want to have:
Reduce length checks
Take advantage of move semantics when push'n_back
You might also want to create a binary function, if needed.
template <typename InputIt, typename OutputIt, typename UnaryCallable>
void move_append(InputIt first, InputIt last, OutputIt firstOut, OutputIt lastOut, UnaryCallable fn)
{
if (std::distance(first, last) < std::distance(firstOut, lastOut)
return;
while (first != last && firstOut != lastOut) {
*firstOut++ = std::move( fn(*first++) );
}
}
a call could be:
std::vector<T1> vec1 {/* filled with T1's */};
std::vector<T2> vec2 {/* filled with T2's */};
// ...
vec1.resize( vec1.size() + vec2.size() );
move_append( vec1.begin(), vec1.end(), vec2.begin(), vec2.end(), f );
I'm not sure you can do this with plain algorithms because back_inserter would call Container::push_back which will check in any case for reallocation. Also, the element won't be able to benefit from move semantics.
Note: the safety check depends on your usage, based on how you pass the elements to append. Also it should return a bool.
Some measurements here. I can't explain that big discrepancy.

I do not get the same results as #VaughnCato - although I do a slightly different test of std::string to int. According to my tests the push_back and std::transform methods are equally good, while the boost::transform method is slightly worse. Here is my full code:
UPDATE
I included another test case that instead of using reserve and back_inserter, just uses resize. This is essentially the same method as in #black's answer, and also the method suggested by #ChrisDrew in the question comments. I also performed the test 'both ways' that is std::string -> int, and int -> std::string.
#include <iostream>
#include <vector>
#include <iterator>
#include <algorithm>
#include <cstdint>
#include <chrono>
#include <numeric>
#include <random>
#include <boost/iterator/transform_iterator.hpp>
using std::size_t;
std::vector<int> generate_random_ints(size_t n)
{
std::default_random_engine generator;
auto seed1 = std::chrono::system_clock::now().time_since_epoch().count();
generator.seed((unsigned) seed1);
std::uniform_int_distribution<int> uniform {};
std::vector<int> v(n);
std::generate_n(v.begin(), n, [&] () { return uniform(generator); });
return v;
}
std::vector<std::string> generate_random_strings(size_t n)
{
std::default_random_engine generator;
auto seed1 = std::chrono::system_clock::now().time_since_epoch().count();
generator.seed((unsigned) seed1);
std::uniform_int_distribution<int> uniform {};
std::vector<std::string> v(n);
std::generate_n(v.begin(), n, [&] () { return std::to_string(uniform(generator)); });
return v;
}
template <typename D=std::chrono::nanoseconds, typename F>
D benchmark(F f, unsigned num_tests)
{
D total {0};
for (unsigned i = 0; i < num_tests; ++i) {
auto start = std::chrono::system_clock::now();
f();
auto end = std::chrono::system_clock::now();
total += std::chrono::duration_cast<D>(end - start);
}
return D {total / num_tests};
}
template <typename T1, typename T2, typename UnaryOperation>
void push_back_concat(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
vec1.reserve(vec1.size() + vec2.size());
for (const auto& x : vec2) {
vec1.push_back(op(x));
}
}
template <typename T1, typename T2, typename UnaryOperation>
void transform_concat_reserve(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
vec1.reserve(vec1.size() + vec2.size());
std::transform(vec2.begin(), vec2.end(), std::back_inserter(vec1), op);
}
template <typename T1, typename T2, typename UnaryOperation>
void transform_concat_resize(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
auto vec1_size = vec1.size();
vec1.resize(vec1.size() + vec2.size());
std::transform(vec2.begin(), vec2.end(), vec1.begin() + vec1_size, op);
}
template <typename T1, typename T2, typename UnaryOperation>
void boost_transform_concat(std::vector<T1> vec1, const std::vector<T2> &vec2, UnaryOperation op)
{
auto v2_begin = boost::make_transform_iterator(vec2.begin(), op);
auto v2_end = boost::make_transform_iterator(vec2.end(), op);
vec1.insert(vec1.end(), v2_begin, v2_end);
}
int main(int argc, char **argv)
{
unsigned num_tests {1000};
size_t vec1_size {1000000};
size_t vec2_size {1000000};
// Switch the variable names to inverse test
auto vec1 = generate_random_ints(vec1_size);
auto vec2 = generate_random_strings(vec2_size);
auto op = [] (const std::string& str) { return std::stoi(str); };
//auto op = [] (int i) { return std::to_string(i); };
auto f_push_back_concat = [&vec1, &vec2, &op] () {
push_back_concat(vec1, vec2, op);
};
auto f_transform_concat_reserve = [&vec1, &vec2, &op] () {
transform_concat_reserve(vec1, vec2, op);
};
auto f_transform_concat_resize = [&vec1, &vec2, &op] () {
transform_concat_resize(vec1, vec2, op);
};
auto f_boost_transform_concat = [&vec1, &vec2, &op] () {
boost_transform_concat(vec1, vec2, op);
};
auto push_back_concat_time = benchmark<std::chrono::milliseconds>(f_push_back_concat, num_tests).count();
auto transform_concat_reserve_time = benchmark<std::chrono::milliseconds>(f_transform_concat_reserve, num_tests).count();
auto transform_concat_resize_time = benchmark<std::chrono::milliseconds>(f_transform_concat_resize, num_tests).count();
auto boost_transform_concat_time = benchmark<std::chrono::milliseconds>(f_boost_transform_concat, num_tests).count();
std::cout << "push_back: " << push_back_concat_time << "ms" << std::endl;
std::cout << "transform_reserve: " << transform_concat_reserve_time << "ms" << std::endl;
std::cout << "transform_resize: " << transform_concat_resize_time << "ms" << std::endl;
std::cout << "boost_transform: " << boost_transform_concat_time << "ms" << std::endl;
return 0;
}
Compiled using:
g++ vector_concat.cpp -std=c++11 -O3 -o vector_concat_test
The results (mean user-times) are :
| Method | std::string -> int (ms) | int -> std::string (ms) |
|:------------------------:|:-----------------------:|:-----------------------:|
| push_back | 68 | 206 |
| std::transform (reserve) | 68 | 202 |
| std::transform (resize) | 67 | 218 |
| boost::transform | 70 | 238 |
PROVISIONAL CONCLUSION
The std::transform method using resize is likely optimal (using STL) for trivial to default-construct types.
The std::transform method using reserve and back_inserter is most likely the best we can do otherwise.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why would a parallel version of accumulate be so much slower? - c++

Related

thread doesnot update referenced variable

Is there a way of making a copy of the vector to template?

Why my program runs faster on 1 thread than on 8. C++

Type safe index values for std::vector

What is the optimal way to concatenate two vectors whilst transforming elements of one vector?

Categories

Resources