I'm quite new to C++ and programming in general. To practise, I made a sorting algorithm similar to mergesort. Then I tried to make it multi-threaded.
std::future<T*> first = std::async(std::launch::async, &mergesort, temp1, temp1size);
std::future<T*> second = std::async(std::launch::async, &mergesort, temp2, temp2size);
temp1 = first.get();
temp2 = second.get();
But it seems my compiler can't decide which template to use as I get the same error twice.
Error 1 error C2783: 'std::future<result_of<enable_if<std::_Is_launch_type<_Fty>::value,_Fty>::type(_ArgTypes...)>::type> std::async(_Policy_type,_Fty &&,_ArgTypes &&...)' : could not deduce template argument for '_Fty'
Error 2 error C2784: 'std::future<result_of<enable_if<!std::_Is_launch_type<decay<_Ty>::type>::value,_Fty>::type(_ArgTypes...)>::type> std::async(_Fty &&,_ArgTypes &&...)' : could not deduce template argument for '_Fty &&' from 'std::launch'
The errors lead me to believe that std::async is overloaded with two different templates, one for a specified policy and one for an unspecified, and the compiler fails to select the correct one (I'm using Visual Studio Express 2013). So how do I specify to the compiler the appropriate template? (doing std::future<T*> second = std::async<std::launch::async>(&mergesort, temp2, temp2size); doesn't seem to work, I get invalid template argument, type expected). And is there a better way to do this all-together?
Thanks!
You need to specify the template parameter for mergesort. Async isn't going to be smart enough to figure it out on its own. An example that is iterator based appears below. It also utilizes the current active thread as a recursion point rather than burning a thread handle waiting on two other threads.
I warn you, there are better ways to do this, but tuning this may suffice your needs.
#include <iostream>
#include <algorithm>
#include <vector>
#include <thread>
#include <future>
#include <random>
#include <atomic>
static std::atomic_uint_fast64_t n_threads = ATOMIC_VAR_INIT(0);
template<typename Iter>
void mergesort(Iter begin, Iter end)
{
auto len = std::distance(begin,end);
if (len <= 16*1024) // 16K segments defer to std::sort
{
std::sort(begin,end);
return;
}
Iter mid = std::next(begin,len/2);
// start lower parttion async
auto ft = std::async(std::launch::async, mergesort<Iter>, begin, mid);
++n_threads;
// use this thread for the high-parition.
mergesort(mid, end);
// wait on results, then merge in-place
ft.wait();
std::inplace_merge(begin, mid, end);
}
int main()
{
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_int_distribution<> dist(1,100);
std::vector<int> data;
data.reserve(1024*1024*16);
std::generate_n(std::back_inserter(data), data.capacity(),
[&](){ return dist(rng); });
mergesort(data.begin(), data.end());
std::cout << "threads: " << n_threads << '\n';
}
Output
threads: 1023
You'll have to trust me that the end vector is sorted. not going to dump 16MB of values into this answer.
Notes: This was compiled and tested using clang 3.3 on an Mac and ran without issue. My gcc 4.7.2 unfortunately is brain-dead, as it tosses cookies in a shared-count abort, but I don't have high confidence in the libstdc++ or VM on which it is housed.
Related
I have been attempting to write a simple program to experiment with vectors of threads. I am trying to create a thread at the moment, but I am finding that I am running into an error that my constructor is not initializing properly, with the error that there is no matching constructor for std::thread matching the argument list. Here is what I have done:
#include <functional>
#include <iostream>
#include <numeric>
#include <thread>
#include <vector>
int sum = 0;
void thread_sum (auto it, auto it2, auto init) {
sum = std::accumulate(it, it2, init);
}
int main() {
// * Non Multi-Threaded
// We're going to sum up a bunch of numbers.
std::vector<int> toBeSummed;
for (int i = 0; i < 30000; ++i) {
toBeSummed.push_back(1);
}
// Initialize a sum variable
long sum = std::accumulate(toBeSummed.begin(), toBeSummed.end(), 0);
std::cout << "The sum was " << sum << std::endl;
// * Multi Threaded
// Create threads
std::vector<std::thread> threads;
std::thread t1(&thread_sum, toBeSummed.begin(), toBeSummed.end(), 0);
std::thread t2(&thread_sum, toBeSummed.begin(), toBeSummed.end(), 0);
threads.push_back(std::move(t1));
threads.push_back(std::move(t2));
return 0;
}
The line that messes up is the following:
auto t1 =
std::thread {std::accumulate, std::ref(toBeSummed.begin()),
It is an issue with the constructor. I have tried different combinations of std::ref, std::function, and other wrappers, and tried making my own function lambda object as a wrapper for accumulate.
Here is some additional information:
The error message is : atomics.cpp:28:7: error: no matching constructor for initialization of 'std::thread'
Moreover, when hovering over the constructor, it tells me that the first parameter is of <unknown_type>.
Other attempts I have tried:
Using references instead of regular value parameters
Using std::bind
Using std::function
Declaring the function in a variable and passing that as my first parameter to the constructor
Compiling with different flags, like std=c++2a
EDIT:
I will leave the original issue as a means for others to learn from my mistakes. As the answer I accept will show, this is due to my excessive usage of auto. I had read a C++ book that basically said "always use auto, it's much more readable! Like Python and dynamic typing, but with the performance of C++," yet clearly this cannot always be done. The using keyword provides the readability while still the safety. Thank you for the answers!
The problems you're encountering are because std::accumulate is an overloaded function template, so the compiler doesn't know what specific function type to treat it as when passed as an argument to the thread constructor. Similar problems arise with your thread_sum function because of the auto parameters.
You can choose a specific overload/instantiation of std::accumulate as follows:
std::thread t2(
(int(*)(decltype(toBeSummed.begin()), decltype(toBeSummed.end()), int))std::accumulate,
toBeSummed.begin(), toBeSummed.end(), 0);
The problem is your excessive use of auto. You can fix it by changing this one line:
void thread_sum (auto it, auto it2, auto init) {
To this:
using Iter = std::vector<int>::const_iterator;
void thread_sum (Iter it, Iter it2, int init) {
I'm a bit of a newcomer to CUDA and thrust. I seem to be unable to get the thrust::for_each algorithm to work when supplied with a counting_iterator.
Here is my simple functor:
struct print_Functor {
print_Functor(){}
__host__ __device__
void operator()(int i)
{
printf("index %d\n", i);
}
};
Now if I call this with a host-vector prefilled with a sequence, it works fine:
thrust::host_vector<int> h_vec(10);
thrust::sequence(h_vec.begin(),h_vec.end());
thrust::for_each(h_vec.begin(),h_vec.end(), print_Functor());
However, if I try to do this with thrust::counting_iterator it fails:
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first+10;
for(thrust::counting_iterator<int> it=first;it!=last;it++)
printf("Value %d\n", *it);
printf("Launching for_each\n");
thrust::for_each(first,last,print_Functor());
What I get is that the for loop executes correctly, but the for_each fails with the error message:
after cudaFuncGetAttributes: unspecified launch failure
I tried to do this by making the iterator type a template argument:
thrust::for_each<thrust::counting_iterator<int>>(first,last, print_Functor());
but the same error results.
For completeness, I'm calling this from a MATLAB mex file (64 bit).
I've been able to get other thrust algorithms to work with the counting iterator (e.g. thrust::reduce gives the right result).
As a newcomer I'm probably doing something really stupid and missing something obvious - can anyone help?
Thanks for the comments so far. I have taken on board the comments so far. The worked example (outside Matlab) worked correctly and produced output, but if this was made into a mex file it still did not work - the first time producing no output at all and the second time just producing the same error message as before (only fixed by a recompile, when it goes back to no output).
However there is a similar problem with it not executing the functor from thrust::for_each even under DOS. Here is a complete example:
#include <thrust/for_each.h>
#include <thrust/iterator/counting_iterator.h>
struct sum_Functor {
int *sum;
sum_Functor(int *s){sum = s;}
__host__ __device__
void operator()(int i)
{
*sum+=i;
printf("In functor: i %d sum %d\n",i,*sum);
}
};
int main(){
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first+10;
int sum = 0;
sum_Functor sf(&sum);
printf("After constructor: value is %d\n", *(sf.sum));
for(int i=0;i<5;i++){
sf(i);
}
printf("Initiating for_each call - current value %d\n", (*(sf.sum)));
thrust::for_each(first,last,sf);
cudaDeviceSynchronize();
printf("After for_each: value is %d\n",*(sf.sum));
}
This is compiled under a DOS prompt with:
nvcc -o pf pf.cu
The output produced is:
After constructor: value is 0
In functor: i 0 sum 0
In functor: i 1 sum 1
In functor: i 2 sum 3
In functor: i 3 sum 6
In functor: i 4 sum 10
Initiating for_each call - current value 10
After for_each: value is 10
In other words the functor's overloaded operator() is called correctly from the for loop but is never called by the thrust::for_each algorithm. The only way to get the for_each to execute the functor when using the counting iterator is to omit the member variable.
( I should add that after years of using pure Matlab, my C++ is very rusty, so I could be missing something obvious ...)
On your comments you say that you want your code to be executed on host side.
The error code "unspecified launch failure", and the fact your functor is defined as host device make me think thrust wants to execute on your device.
Can you add an execution policy to be sure where your code is executed ?
replace :
thrust::for_each(first,last,sf);
with
thrust::for_each(thrust::host, first,last,sf);
To be able to run on the GPU, your result must be allocated on device memory (through cudaMalloc) then copied back to host.
#include <thrust/host_vector.h>
#include <thrust/sequence.h>
#include <thrust/for_each.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
struct sum_Functor {
int *sum;
sum_Functor(int *s){sum=s;}
__host__ __device__
void operator()(int i)
{
atomicAdd(sum, 1);
}
};
int main(int argc, char**argv){
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first+atoi(argv[1]);
int *d_sum;
int h_sum = 0;
cudaMalloc(&d_sum,sizeof(int));
cudaMemcpy(d_sum,&h_sum,sizeof(int),cudaMemcpyHostToDevice);
thrust::for_each(thrust::device,first,last,sum_Functor(d_sum));
cudaDeviceSynchronize();
cudaMemcpy(&h_sum,d_sum,sizeof(int),cudaMemcpyDeviceToHost);
printf("sum = %d\n", *h_sum);
cudaFree(d_sum);
}
Code Update : To have the correct result on your device you must use an atomic operation.
I came across a Youtube video on c++11 concurrency (part 3) and the following code, which compiles and generates correct result in the video.
However, I got a compile error of this code using Visual Studio 2012. The compiler complains about the argument type of toSin(list<double>&&). If I change the argument type to list<double>&, the code compiled.
My question is what is returned from move(list) in the _tmain(), is it a rvalue reference or just a reference?
#include "stdafx.h"
#include <iostream>
#include <thread>
#include <chrono>
#include <list>
#include <algorithm>
using namespace std;
void toSin(list<double>&& list)
{
//this_thread::sleep_for(chrono::seconds(1));
for_each(list.begin(), list.end(), [](double & x)
{
x = sin(x);
});
for_each(list.begin(), list.end(), [](double & x)
{
int count = static_cast<int>(10*x+10.5);
for (int i=0; i<count; ++i)
{
cout.put('*');
}
cout << endl;
});
}
int _tmain(int argc, _TCHAR* argv[])
{
list<double> list;
const double pi = 3.1415926;
const double epsilon = 0.00000001;
for (double x = 0.0; x<2*pi+epsilon; x+=pi/16)
{
list.push_back(x);
}
thread th(&toSin, /*std::ref(list)*/std::move(list));
th.join();
return 0;
}
This appears to be a bug in MSVC2012. (and on quick inspection, MSVC2013 and MSVC2015)
thread does not use perfect forwarding directly, as storing a reference to data (temporary or not) in the originating thread and using it in the spawned thread would be extremely error prone and dangerous.
Instead, it copies each argument into decay_t<?>'s internal data.
The bug is that when it calls the worker function, it simply passes that internal copy to your procedure. Instead, it should move that internal data into the call.
This does not seem to be fixed in compiler version 19, which I think is MSVC2015 (did not double check), based off compiling your code over here
This is both due to the wording of the standard (it is supposed to invoke a decay_t<F> with decay_t<Ts>... -- which means rvalue binding, not lvalue binding), and because the local data stored in the thread will never be used again after the invocation of your procedure (so logically it should be treated as expiring data, not persistent data).
Here is a work around:
template<class F>
struct thread_rvalue_fix_wrapper {
F f;
template<class...Args>
auto operator()(Args&...args)
-> typename std::result_of<F(Args...)>::type
{
return std::move(f)( std::move(args)... );
}
};
template<class F>
thread_rvalue_fix_wrapper< typename std::decay<F>::type >
thread_rvalue_fix( F&& f ) { return {std::forward<F>(f)}; }
then
thread th(thread_rvalue_fix(&toSin), /*std::ref(list)*/std::move(list));
should work. (tested in MSVC2015 online compiler linked above) Based off personal experience, it should also work in MSVC2013. I don't know about MSVC2012.
What is returned from std::move is indeed an rvalue reference, but that doesn't matter because the thread constructor does not use perfect forwarding for its arguments. First it copies/moves them to storage owned by the new thread. Then, inside the new thread, the supplied function is called using the copies.
Since the copies are not temporary objects, this step won't bind to rvalue-reference parameters.
What the Standard says (30.3.1.2):
The new thread of execution executes
INVOKE( DECAY_COPY(std::forward<F>(f)), DECAY_COPY(std::forward<Args>(args))... )
with the calls to
DECAY_COPY being evaluated in the constructing thread.
and
In several places in this Clause the operation DECAY_COPY(x) is used. All such uses mean call the function decay_copy(x) and use the result, where decay_copy is defined as follows:
template <class T> decay_t<T> decay_copy(T&& v)
{ return std::forward<T>(v); }
The value category is lost.
I have:
vector of unique_ptrs of ObjectA
vector of newly default constructed vector of ObjectB, and
a function in Object B that has signature void f(unique_ptr<ObjectA> o).
(word Object omitted from here on)
How do I do Bvec[i].f(Avec[i]) for all 0 < i < length in parallel?
I have tried using transform(Bvec.begin(), Bvec.end(), A.begin(), B.begin(), mem_fun_ref(&B::f)), but it gives a bunch of errors and I'm not sure if it would even pass the right A as parameter, let alone allow me to move them. (&B::f(A.begin()) would not work as the last parameter either.
I have also thought of using for_each and then a lambda function, but not sure how to get the corresponding element. I thought of incrementing a counter, but then I don't think that parallelizes well (I could be wrong).
I can, of course, use a for loop from 0 to end, but I am pretty sure there is a simple thing I'm missing, and it is not parallel with a simple for loop.
Thanks.
Here is a non-parallel implementation using a handmade algorithm. I'm sure someone more versed in the functional could come up with a more elegant solution. The problem with transform is, that we cannot use it with functions that return void and I can't remember another stdlib function that takes two ranges and apply them to each other. If you really want to parallelize this, it needs to be done in the apply_to function. Launching an async task (e.g. std::async(*begin++, *begin2++) could work, although I have no experience with this and cannot get it to work on gcc 4.6.2.
#include <iterator>
#include <memory>
#include <vector>
#include <algorithm>
#include <functional>
// this is very naive it should check call different versions
// depending on the value_type of iterator2, especially considering
// that a tuple would make sense
template<typename InputIterator1, typename InputIterator2>
void apply_to(InputIterator1 begin, InputIterator1 end, InputIterator2 begin2) {
while(begin != end) {
(*begin++)(*begin2++);
}
}
struct Foo {
};
struct Bar {
void f(std::unique_ptr<Foo>) { }
};
int main()
{
std::vector< std::unique_ptr<Foo> > foos(10);
std::vector< Bar > bars(10);
std::vector< std::function<void(std::unique_ptr<Foo>) > > funs;
std::transform(bars.begin(), bars.end(), std::back_inserter(funs),
// non-const due to f non-const, change accordingly
[](Bar& b) { return std::bind(&Bar::f, &b, std::placeholders::_1); });
// now we just need to apply each element in foos with funs
apply_to(funs.begin(), funs.end(), std::make_move_iterator(foos.begin()));
return 0;
}
So I was trying to test a lambda accessing local variables in the scope in which it is used, based roughly on a simple example by Bjarne on the C++0x FAQS page at:
http://www2.research.att.com/~bs/C++0xFAQ.html#lambda
When I try this simple test code:
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
//Test std::fill() with C++0x lambda and local var
void f (int v) {
vector<int> indices(v);
int count = 0;
fill(indices.begin(), indices.end(), [&count]() {
return ++count;
});
//output test indices
for (auto x : indices) {
cout << x << endl;
}
}
int main() {
f(50);
}
I get the error:
required from 'void std::fill(_ForwardIterator, _ForwardIterator, const _Tp&) [with _ForwardIterator = __gnu_cxx::__normal_iterator<int*, std::vector<int> >, _Tp = f(int)::<lambda()>]'
I'm supposing this errmsg indicates the std::fill() signature requires a const Type& to use for the new value element assignment.
But if I'm to be able to use the fill() for this purpose, as indicated by Bjarne's example, won't I need to use a reference '[&count]' inside the lambda capture clause to be able to reassign the original indices element value with the incrementing count var via the 'return ++count;' lambda statement block?
I admit I don't quite understand all about these lambdas just yet! :)
Bjarne's example doesn't compile. It can't compile, not unless they defined std::fill differently in C++0x. Maybe it was from a conceptized version of std::fill that could take a function, but the actual version of it (according to section 25.1 of N3242) takes an object, not a function. It copies that object into every element of the list. Which is what that one is trying to do.
The function you're looking for is std::generate.
Try this:
for_each(indices.begin(), indices.end(), [&count](int& it)
{
it = ++count;
});
it is currently iterated content of vector, and is coming via reference.
I hope it's OK to add an "update" style answer, for the benefit of any future readers who may have this same question. Please let me know since I'm new here.
So, here's my final reworked form of the code that does what I'm wanting:
#include <iostream>
#include <vector>
#include <algorithm>
//Overwrite a vector<int> with incrementing values, base-n.
void init_integers(std::vector<int>& ints, int base) {
int index{ base };
std::generate(ints.begin(), ints.end(), [&index]() {
return index++; //post-incr.
});
}
//Default wrapper to overwrite a vector<int>
// with incrementing values, base-0.
void init_integers(std::vector<int>& ints) {
init_integers(ints, 0);
}
//Test lambda-based vector<int> initialization.
int main() {
std::vector<int> indices( 50 );
init_integers(indices);
//test output loaded indices.
for (auto x : indices) {
std::cout << x << std::endl;
}
}
Thanks for the helpful answers, I find this a much easier approach. I'll very likely be using lambdas from now on for algorithms that take a function object!
Update 2:
Based on ildjarn's comment to the original post above:
"Note that the exact functionality here is implemented by a new C++0x algorithm -- std::iota."
After testing, I've modified the appropriate code to:
...
#include <numeric>
//Overwrite a vector<int> with incrementing values, base-n.
void init_integers(std::vector<int>& ints, int base) {
std::iota(ints.begin(), ints.end(), base);
}
...
and it's working fine. ("Iota", s26.7.6, of N3242).
The simpler and cleaner (though a bit obscure), the easier to read--and more importantly--maintain.
Thanks ildjarn! (Though it was a good exercise personally to go through this process to pick up some further insight on the C++0x lambdas!) :)
-Bud Alverson