cuda thrust::for_each with thrust::counting_iterator - c++

I'm a bit of a newcomer to CUDA and thrust. I seem to be unable to get the thrust::for_each algorithm to work when supplied with a counting_iterator.
Here is my simple functor:
struct print_Functor {
print_Functor(){}
__host__ __device__
void operator()(int i)
{
printf("index %d\n", i);
}
};
Now if I call this with a host-vector prefilled with a sequence, it works fine:
thrust::host_vector<int> h_vec(10);
thrust::sequence(h_vec.begin(),h_vec.end());
thrust::for_each(h_vec.begin(),h_vec.end(), print_Functor());
However, if I try to do this with thrust::counting_iterator it fails:
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first+10;
for(thrust::counting_iterator<int> it=first;it!=last;it++)
printf("Value %d\n", *it);
printf("Launching for_each\n");
thrust::for_each(first,last,print_Functor());
What I get is that the for loop executes correctly, but the for_each fails with the error message:
after cudaFuncGetAttributes: unspecified launch failure
I tried to do this by making the iterator type a template argument:
thrust::for_each<thrust::counting_iterator<int>>(first,last, print_Functor());
but the same error results.
For completeness, I'm calling this from a MATLAB mex file (64 bit).
I've been able to get other thrust algorithms to work with the counting iterator (e.g. thrust::reduce gives the right result).
As a newcomer I'm probably doing something really stupid and missing something obvious - can anyone help?
Thanks for the comments so far. I have taken on board the comments so far. The worked example (outside Matlab) worked correctly and produced output, but if this was made into a mex file it still did not work - the first time producing no output at all and the second time just producing the same error message as before (only fixed by a recompile, when it goes back to no output).
However there is a similar problem with it not executing the functor from thrust::for_each even under DOS. Here is a complete example:
#include <thrust/for_each.h>
#include <thrust/iterator/counting_iterator.h>
struct sum_Functor {
int *sum;
sum_Functor(int *s){sum = s;}
__host__ __device__
void operator()(int i)
{
*sum+=i;
printf("In functor: i %d sum %d\n",i,*sum);
}
};
int main(){
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first+10;
int sum = 0;
sum_Functor sf(&sum);
printf("After constructor: value is %d\n", *(sf.sum));
for(int i=0;i<5;i++){
sf(i);
}
printf("Initiating for_each call - current value %d\n", (*(sf.sum)));
thrust::for_each(first,last,sf);
cudaDeviceSynchronize();
printf("After for_each: value is %d\n",*(sf.sum));
}
This is compiled under a DOS prompt with:
nvcc -o pf pf.cu
The output produced is:
After constructor: value is 0
In functor: i 0 sum 0
In functor: i 1 sum 1
In functor: i 2 sum 3
In functor: i 3 sum 6
In functor: i 4 sum 10
Initiating for_each call - current value 10
After for_each: value is 10
In other words the functor's overloaded operator() is called correctly from the for loop but is never called by the thrust::for_each algorithm. The only way to get the for_each to execute the functor when using the counting iterator is to omit the member variable.
( I should add that after years of using pure Matlab, my C++ is very rusty, so I could be missing something obvious ...)

On your comments you say that you want your code to be executed on host side.
The error code "unspecified launch failure", and the fact your functor is defined as host device make me think thrust wants to execute on your device.
Can you add an execution policy to be sure where your code is executed ?
replace :
thrust::for_each(first,last,sf);
with
thrust::for_each(thrust::host, first,last,sf);
To be able to run on the GPU, your result must be allocated on device memory (through cudaMalloc) then copied back to host.
#include <thrust/host_vector.h>
#include <thrust/sequence.h>
#include <thrust/for_each.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
struct sum_Functor {
int *sum;
sum_Functor(int *s){sum=s;}
__host__ __device__
void operator()(int i)
{
atomicAdd(sum, 1);
}
};
int main(int argc, char**argv){
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first+atoi(argv[1]);
int *d_sum;
int h_sum = 0;
cudaMalloc(&d_sum,sizeof(int));
cudaMemcpy(d_sum,&h_sum,sizeof(int),cudaMemcpyHostToDevice);
thrust::for_each(thrust::device,first,last,sum_Functor(d_sum));
cudaDeviceSynchronize();
cudaMemcpy(&h_sum,d_sum,sizeof(int),cudaMemcpyDeviceToHost);
printf("sum = %d\n", *h_sum);
cudaFree(d_sum);
}
Code Update : To have the correct result on your device you must use an atomic operation.

Related

Output parameters with arrays in one function with Arduino and C ++

I have a problem with the initialization with various parameters in my function.
It works if I have created an array int params [] = {...}. However, it doesn't work if I want to write the parameters directly into the function.
declaration (in the .h)
void phase_an(int led[]);
in the .cpp
void RS_Schaltung::phase_an(int led[])
{
for (size_t i = 0; i < LEN(led); i++) {
digitalWrite(led[i], HIGH);
}
}
if I try this way, it won't work. I would like it to be like that. But I couldn't find anything about it on the internet. ...:
in the Arduino sketch:
RS.phase_an(RS.ampelRot, RS.ampelGelb, ..... ); <--- is there a way to do it like that?
what amazes me is that it works this way:
int p_an [5] = {RS.ampelRot, RS.ampelGelb, RS.ampelGruen, RS.rot, RS.gelb};
...................
RS.phase_an (p_an);
does anyone have a suggestion?
There are several ways of making a function accepting a variable number of arguments here.
However, in your current code there is a problem: when you pass a native array of unknown size as argument of a function (e.g. void f(int a[])), the argument will be managed a pointer to an array, and there is no way inside this function to know the real length of that array. I don't know how LEN() is defined, but chances are high that it doesn't works well in your code.
A safer and more practical alternative is to use a vector<int> instead:
#include <iostream>
#include <vector>
using namespace std;
void f(const vector<int>& a){
for (int i=0; i<a.size(); i++) {
cout<<a[i]<<" ";
}
cout<<endl;
}
int main() {
vector<int> test={1,2,3,4};
f(test);
f({1,2,3,4});
return 0;
}
In this case, you can pass your multiple values between bracket in the function call (e.g. ({RS.ampelRot, RS.ampelGelb, RS.ampelGruen, RS.rot, RS.gelb})and C++ will automatically convert it to a vector.

Thread Constructor Initialization C++

I have been attempting to write a simple program to experiment with vectors of threads. I am trying to create a thread at the moment, but I am finding that I am running into an error that my constructor is not initializing properly, with the error that there is no matching constructor for std::thread matching the argument list. Here is what I have done:
#include <functional>
#include <iostream>
#include <numeric>
#include <thread>
#include <vector>
int sum = 0;
void thread_sum (auto it, auto it2, auto init) {
sum = std::accumulate(it, it2, init);
}
int main() {
// * Non Multi-Threaded
// We're going to sum up a bunch of numbers.
std::vector<int> toBeSummed;
for (int i = 0; i < 30000; ++i) {
toBeSummed.push_back(1);
}
// Initialize a sum variable
long sum = std::accumulate(toBeSummed.begin(), toBeSummed.end(), 0);
std::cout << "The sum was " << sum << std::endl;
// * Multi Threaded
// Create threads
std::vector<std::thread> threads;
std::thread t1(&thread_sum, toBeSummed.begin(), toBeSummed.end(), 0);
std::thread t2(&thread_sum, toBeSummed.begin(), toBeSummed.end(), 0);
threads.push_back(std::move(t1));
threads.push_back(std::move(t2));
return 0;
}
The line that messes up is the following:
auto t1 =
std::thread {std::accumulate, std::ref(toBeSummed.begin()),
It is an issue with the constructor. I have tried different combinations of std::ref, std::function, and other wrappers, and tried making my own function lambda object as a wrapper for accumulate.
Here is some additional information:
The error message is : atomics.cpp:28:7: error: no matching constructor for initialization of 'std::thread'
Moreover, when hovering over the constructor, it tells me that the first parameter is of <unknown_type>.
Other attempts I have tried:
Using references instead of regular value parameters
Using std::bind
Using std::function
Declaring the function in a variable and passing that as my first parameter to the constructor
Compiling with different flags, like std=c++2a
EDIT:
I will leave the original issue as a means for others to learn from my mistakes. As the answer I accept will show, this is due to my excessive usage of auto. I had read a C++ book that basically said "always use auto, it's much more readable! Like Python and dynamic typing, but with the performance of C++," yet clearly this cannot always be done. The using keyword provides the readability while still the safety. Thank you for the answers!
The problems you're encountering are because std::accumulate is an overloaded function template, so the compiler doesn't know what specific function type to treat it as when passed as an argument to the thread constructor. Similar problems arise with your thread_sum function because of the auto parameters.
You can choose a specific overload/instantiation of std::accumulate as follows:
std::thread t2(
(int(*)(decltype(toBeSummed.begin()), decltype(toBeSummed.end()), int))std::accumulate,
toBeSummed.begin(), toBeSummed.end(), 0);
The problem is your excessive use of auto. You can fix it by changing this one line:
void thread_sum (auto it, auto it2, auto init) {
To this:
using Iter = std::vector<int>::const_iterator;
void thread_sum (Iter it, Iter it2, int init) {

CUDA C++11, array of lambdas, function by index, not working

I am having trouble trying to make a CUDA program manage an array of lambdas by their index. An example code that reproduces the problem
#include <cuda.h>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <cassert>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
if (code != cudaSuccess) {
fprintf(stderr,"GPUassert: %s %s %d\n",
cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
template<typename Lambda>
__global__ void kernel(Lambda f){
int t = blockIdx.x * blockDim.x + threadIdx.x;
printf("device: thread %i: ", t);
printf("f() = %i\n", f() );
}
int main(int argc, char **argv){
// arguments
if(argc != 2){
fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
exit(EXIT_FAILURE);
}
int i = atoi(argv[1]);
// lambdas
auto lam0 = [] __host__ __device__ (){ return 333; };
auto lam1 = [] __host__ __device__ (){ return 777; };
// make vector of functions
std::vector<int(*)()> v;
v.push_back(lam0);
v.push_back(lam1);
// host: calling a function by index
printf("host: f() = %i\n", (*v[i])() );
// device: calling a function by index
kernel<<< 1, 1 >>>( v[i] ); // does not work
//kernel<<< 1, 1 >>>( lam0 ); // does work
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return EXIT_SUCCESS;
}
Compiling with
nvcc -arch sm_60 -std=c++11 --expt-extended-lambda main.cu -o prog
The error I get when running is
➜ cuda-lambda ./prog 0
host: f() = 333
device: GPUassert: invalid program counter main.cu 53
It seems that CUDA cannot manage the int(*)() function pointer form (while host c++ does work properly). On the other hand, each lambda is managed as a different data type, no matter if they are identical in code and have the same contract. Then, how can we achieve function by index in CUDA?
There are a few considerations here.
Although you suggest wanting to "manage an array of lambdas", you are actually relying on the graceful conversion of a lambda to a function pointer (possible when the lambda does not capture).
When you mark something as __host__ __device__, you are declaring to the compiler that two copies of said item need to be compiled (with two obviously different entry points): one for the CPU, and one for the GPU.
When we take a __host__ __device__ lambda and ask it to degrade to a function pointer, we are then left with the question "which function pointer (entry point) to choose?" The compiler no longer has the option to carry about the experimental lambda object anymore, and so it must choose one or the other (host or device, CPU or GPU) for your vector. Whichever one it chooses, the vector could (will) break if used in the wrong environment.
One takeaway from this is that your two test cases are not the same. In one case (broken) you are passing a function pointer to the kernel (so the kernel is templated to accept a function pointer argument) and in the other case (working) you are passing a lambda to the kernel (so the kernel is templated to accept a lambda argument).
The problem here, in my view, is not simply arising out of use of a container, but arising out of the type of container you are using. I can demonstrate this in a simple way (see below) by converting your vector to a vector of actual lambda type. In that case, we can make the code "work" (sort of), but since every lambda has a unique type, this is an uninteresting demonstration. We can create a multi-element vector, but the only element we can store in it is one of your two lambdas (not both at the same time).
If we use a container that can handle dissimilar types (e.g. std::tuple), perhaps we can make some progress here, but I know of no direct method to index through the elements of such a container. Even if we could, the template kernel accepting lambda as argument/template type would have to be instantiated for each lambda.
In my view, function pointers avoid this particular type "messiness".
Therefore, as an answer to this question:
Then, how can we achieve function by index in CUDA?
I would suggest for the time being that function by index in host code be separated (e.g. two separate containers) from function by index in device code, and for function by index in device code, you use any of the techniques (which don't use or depend on lambdas) covered in other questions, such as this one.
Here is a worked example (I think) demonstrating the note above, that we can create a vector of lambda "type", and use the resultant element(s) from that vector as lambdas in both host and device code:
$ cat t64.cu
#include <cuda.h>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <cassert>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
if (code != cudaSuccess) {
fprintf(stderr,"GPUassert: %s %s %d\n",
cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
template<typename Lambda>
__global__ void kernel(Lambda f){
int t = blockIdx.x * blockDim.x + threadIdx.x;
printf("device: thread %i: ", t);
printf("f() = %i\n", f() );
}
template <typename T>
std::vector<T> fill(T L0, T L1){
std::vector<T> v;
v.push_back(L0);
v.push_back(L1);
return v;
}
int main(int argc, char **argv){
// arguments
if(argc != 2){
fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
exit(EXIT_FAILURE);
}
int i = atoi(argv[1]);
// lambdas
auto lam0 = [] __host__ __device__ (){ return 333; };
auto lam1 = [] __host__ __device__ (){ return 777; };
auto v = fill(lam0, lam0);
// make vector of functions
// std::vector< int(*)()> v;
// v.push_back(lam0);
// v.push_back(lam1);
// host: calling a function by index
// host: calling a function by index
printf("host: f() = %i\n", (*v[i])() );
// device: calling a function by index
kernel<<< 1, 1 >>>( v[i] ); // does not work
//kernel<<< 1, 1 >>>( lam0 ); // does work
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return EXIT_SUCCESS;
}
$ nvcc -arch sm_61 -std=c++11 --expt-extended-lambda t64.cu -o t64
$ cuda-memcheck ./t64 0
========= CUDA-MEMCHECK
host: f() = 333
device: thread 0: f() = 333
========= ERROR SUMMARY: 0 errors
$ cuda-memcheck ./t64 1
========= CUDA-MEMCHECK
host: f() = 333
device: thread 0: f() = 333
========= ERROR SUMMARY: 0 errors
$
As mentioned above already, this code is not a sensible code. It is advanced to prove a particular point.

CUDA curand "An illegal memory access was encountered"

I've been spending a lot of time trying to figure out the cause of this problem. The following code attempts to generate a sequence of normally distributed random variables using curand on the device. It seems to generate a few successfully, but then crashes with an "illegal memory address was encountered error". Any help is much appreciated.
main.cu
#include <stdio.h>
#include <cuda.h>
#include <curand_kernel.h>
class A {
public:
__device__ A(const size_t& seed) {
printf("\nA()");
curandState state;
curand_init(seed, 0, 0, &state);
for(size_t i = 0; i < 1000; ++i)
printf("\n%f", curand_normal(&state));
}
__device__ ~A() { printf("\n~A()"); }
};
/// Kernel
__global__ void kernel(const size_t& seed) {
printf("\nHello from Kernel...");
A a(seed);
return;
}
int main(void) {
kernel<<<1,1>>>(1);
cudaError_t cudaerr = cudaDeviceSynchronize();
if (cudaerr != CUDA_SUCCESS)
printf("kernel launch failed with error \"%s\".\n",
cudaGetErrorString(cudaerr));
return 0;
}
Output
Hello from Kernel...
A()
0.292537
-0.718359
0.958011
0.633711kernel launch failed with error "an illegal memory access was encountered".
I have ran this both on my machine (CUDA 7.0), and a supercomputing cluster (CUDA 6.5), and the same result unfolds.
Get rid of the pass-by-reference on the kernel parameter (&).
You are not allowed to write GPU kernels that have pass-by-reference parameters. A GPU kernel cannot modify a host variable. (ignoring Unified Memory, Zero-Copy, and related mechanisms which are not at issue here.)

C++ "Could not deduce template argument" when using std::async

I'm quite new to C++ and programming in general. To practise, I made a sorting algorithm similar to mergesort. Then I tried to make it multi-threaded.
std::future<T*> first = std::async(std::launch::async, &mergesort, temp1, temp1size);
std::future<T*> second = std::async(std::launch::async, &mergesort, temp2, temp2size);
temp1 = first.get();
temp2 = second.get();
But it seems my compiler can't decide which template to use as I get the same error twice.
Error 1 error C2783: 'std::future<result_of<enable_if<std::_Is_launch_type<_Fty>::value,_Fty>::type(_ArgTypes...)>::type> std::async(_Policy_type,_Fty &&,_ArgTypes &&...)' : could not deduce template argument for '_Fty'
Error 2 error C2784: 'std::future<result_of<enable_if<!std::_Is_launch_type<decay<_Ty>::type>::value,_Fty>::type(_ArgTypes...)>::type> std::async(_Fty &&,_ArgTypes &&...)' : could not deduce template argument for '_Fty &&' from 'std::launch'
The errors lead me to believe that std::async is overloaded with two different templates, one for a specified policy and one for an unspecified, and the compiler fails to select the correct one (I'm using Visual Studio Express 2013). So how do I specify to the compiler the appropriate template? (doing std::future<T*> second = std::async<std::launch::async>(&mergesort, temp2, temp2size); doesn't seem to work, I get invalid template argument, type expected). And is there a better way to do this all-together?
Thanks!
You need to specify the template parameter for mergesort. Async isn't going to be smart enough to figure it out on its own. An example that is iterator based appears below. It also utilizes the current active thread as a recursion point rather than burning a thread handle waiting on two other threads.
I warn you, there are better ways to do this, but tuning this may suffice your needs.
#include <iostream>
#include <algorithm>
#include <vector>
#include <thread>
#include <future>
#include <random>
#include <atomic>
static std::atomic_uint_fast64_t n_threads = ATOMIC_VAR_INIT(0);
template<typename Iter>
void mergesort(Iter begin, Iter end)
{
auto len = std::distance(begin,end);
if (len <= 16*1024) // 16K segments defer to std::sort
{
std::sort(begin,end);
return;
}
Iter mid = std::next(begin,len/2);
// start lower parttion async
auto ft = std::async(std::launch::async, mergesort<Iter>, begin, mid);
++n_threads;
// use this thread for the high-parition.
mergesort(mid, end);
// wait on results, then merge in-place
ft.wait();
std::inplace_merge(begin, mid, end);
}
int main()
{
std::random_device rd;
std::mt19937 rng(rd());
std::uniform_int_distribution<> dist(1,100);
std::vector<int> data;
data.reserve(1024*1024*16);
std::generate_n(std::back_inserter(data), data.capacity(),
[&](){ return dist(rng); });
mergesort(data.begin(), data.end());
std::cout << "threads: " << n_threads << '\n';
}
Output
threads: 1023
You'll have to trust me that the end vector is sorted. not going to dump 16MB of values into this answer.
Notes: This was compiled and tested using clang 3.3 on an Mac and ran without issue. My gcc 4.7.2 unfortunately is brain-dead, as it tosses cookies in a shared-count abort, but I don't have high confidence in the libstdc++ or VM on which it is housed.