Calling thrust function inside a CUDA Kernel __global___ - c++

I've read about that the dynamic parallelism is supported in the newer version of CUDA, and I can call thrust functions like thrush::exclusive_scan inside a kernel function with thrust::device parameter.
__global__ void kernel(int* inarray, int n, int *result) {
extern __shared__ int s[];
int t = threadIdx.x;
s[t] = inarray[t];
__syncthreads();
thrust::exclusive_scan(thrust::device, s, n, result);
__syncthreads();
}
int main() {
// prep work
kernel<<<1, n, n * sizeof(int)>>>(inarray, n, result);
}
The thing I got confused is:
When calling thrust function inside a kernel, does each thread call the function once and they all do a dynamic parallelism on the data?
If they do, I only need one thread to call thrust so I can just do a if to threadIdx; if not, how do threads in a block communicate with each other that the call to thrust has been done and they should just ignore it(this seems a little imaginary since there wouldn't be a systematical way to ensure from user's code). To summerize, what's exactly happening when I call thrust functions with thrust::device parameter inside a kernel?

Every thread in your kernel that executes the thrust algorithm will execute a separate copy of your algorithm. The threads in your kernel do not cooperate on a single algorithm call.
If you have met all the requirements (HW/SW and compilation settings) for a CUDA dynamic parallelism (CDP) call, then each thread that encounters the thrust algorithm call will launch a CDP child kernel to perform the thrust algorithm (in that case, the threads in the CDP child kernel do cooperate). If not, each thread that encounters the thrust algorithm call will perform it as if you had specified thrust::seq instead of thrust::device.
If you prefer to avoid the CDP activity in an otherwise CDP-capable environment, you can specify thrust::seq instead.
If you intend, for example, that only a single copy of your thrust algorithm be executed, it will be necessary in your kernel code to ensure that only one thread calls it, for example:
if (!threadIdx.x) thrust::exclusive_scan(...
or similar.
Questions around synchronization before/after the call are no different from ordinary CUDA code. If you need all threads in the block to wait for the thrust algorithm to complete, use e.g. __syncthreads(), (and cudaDeviceSynchronize() in the CDP case).
The information here may possibly be of interest as well.

Related

Is it necessary to use synchronization between two calls to CUDA kernels?

So far I have written programs where a kernel is called only once in the program
So I have a kernel
__global__ void someKernel(float * d_in ){ //Any parameters
//some operation
}
and I basically do
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// Point to notice HERE
}
It works fine. However this time I want to call the kernel not only once but many times
Something like
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// From here
//Some unrelated calculations here
dothis();
dothat();
//Then again the kernel repeteadly
for(k: some_ks)
{
// Do some pre-calculations
//call the kernel
someKernel<< <nblocks,512>> >(.......);
// some post calculations
}
}
My question is should I use some kind of synchronization between calling the kernel the first time and calling the kernel in the for loops (and in each iteration)
Perhaps cudaDeviceSynchronize or other? or it is not necessary?
Additional synchronization would not be necessary in this case for at least 2 reasons.
cudaMemcpy is a synchronizing call already. It blocks the CPU thread and waits until all previous CUDA activity issued to that device is complete, before it allows the data transfer to begin. Once the data transfer is complete, the CPU thread is allowed to proceed.
CUDA activity issued to a single device will not overlap in any way unless using CUDA streams. You are not using streams. Therefore even asynchronous work issued to the device will execute in issue order. Item A and B issued to the device in that order will not overlap with each other. Item A will complete before item B is allowed to begin. This is a principal CUDA streams semantic point.

Are functions in CUDA thrust library synchronized implicitly?

I met some problems When using functions in thrust library, I am not sure if I should add cudaDeviceSynchronize manually before it. For example,
double dt = 0;
kernel_1<<<blocks, threads>>>(it);
dt = *(thrust::max_element(it, it + 10));
printf("%f\n", dt);
Since kernel_1 is non-blocking, host will execute the next statement. The problem is I am not sure if the thrust::max_element is blocking. If it is blocking, then it works well; otherwise, will host skip it and execute the "printf" statement?
Thanks
Your code is broken in at least 2 ways.
it is presumably a device pointer:
kernel_1<<<blocks, threads>>>(it);
^^
it is not allowed to use a raw device pointer as an argument to a thrust algorithm:
dt = *(thrust::max_element(it, it + 10));
^^
unless you wrap that pointer in a thrust::device_ptr or else use the thrust::device execution policy explicitly as an argument to the algorithm. Without any of these clues, thrust will dispatch the host code path (which will probably seg fault) as discussed in the thrust quick start guide.
If you fixed the above item using either thrust::device_ptr or thrust::device, then thrust::max_element will return an iterator of a type consistent with the iterators passed to it. If you pass a thrust::device_ptr it will return a thrust::device_ptr. If you use thrust::device with your raw pointer, it will return a raw pointer. In either case, it is illegal to dereference such in host code:
dt = *(thrust::max_element(it, it + 10));
^
again, I would expect such usage to seg fault.
Regarding asynchrony, it is safe to assume that all thrust algorithms that return a scalar quantity stored in stack variable are synchronous. That means the CPU thread will not proceed beyond the thrust call until the stack variable has been populated with the correct value
Regarding GPU activity in general, unless you use streams, all GPU activity is issued to the same (default) stream. This means that all CUDA activity will be executed in-order, and a given CUDA operation will not begin until the preceding CUDA activity is complete. Therefore, even though your kernel launch is asynchronous, and the CPU thread will proceed onto the thrust::max_element call, any CUDA activity spawned from that call will not begin executing until the previous kernel launch is complete. Therefore, any changes made to the data referenced by it by kernel_1 should be finished and completely valid before any CUDA processing in thrust::max_element begins. And as we've seen, thrust::max_element itself will insert synchronization.
So once you fix the defects in your code, there should not be any requirement to insert additional synchronization anywhere.
This function does not seem to be async.
Both of these pages explain the behaviour of max_element() and they do not explicit it as async, so I would assume it is blocking :
thrust : Extrema
max_element
Since it is using an iterator to treat all the elements and find the maximum of the values, I can not think about it to be async.
You can still use cudaDeviceSynchronize to try it for real, but do not forget to set the corresponding flag on your device.

How to get multithreads working properly using pthreads and not boost in class using C++

I have a project that I have summarized here with some pseudo code to illustrate the problem.
I do not have a compiler issue and my code compiles well whether it be using boost or pthreads. Remember this is pseudo code designed to illustrate the problem and not directly compilable.
The problem I am having is that for a multithreaded function the memory usage and processing time is always greater than if the same function is acheived using serial programming e.g for/while loop.
Here is a simplified version of the problem I am facing:
class aproject(){
public:
typedef struct
{
char** somedata;
double output,fitness;
}entity;
entity **entity_array;
int whichthread,numthreads;
pthread_mutex_t mutexdata;
aproject(){
numthreads = 100;
*entity_array=new entity[numthreads];
for(int i;i<numthreads;i++){
entity_array[i]->somedata[i] = new char[100];
}
/*.....more memory allocations for entity_array.......*/
this->initdata();
this->eval_thread();
}
void initdata(){
/**put zeros and ones in entity_array**/
}
float somefunc(char *somedata){
float output=countzero(); //someother function not listed
return output;
}
void* thread_function()
{
pthread_mutex_lock (&mutexdata);
int currentthread = this->whichthread;
this->whichthread+=1;
pthread_mutex_unlock (&mutexdata);
entity *ent = this->entity_array[currentthread];
double A=0,B=0,C=0,D=0,E=0,F=0;
int i,j,k,l;
A = somefunc(ent->somedata[0]);
B = somefunc(ent->somedata[1]);
t4 = anotherfunc(A,B);
ent->output = t4;
ent->fitness = sqrt(pow(t4,2));
}
static void* staticthreadproc(void* p){
return reinterpret_cast<ga*>(p)->thread_function();
}
void eval_thread(){
//use multithreading to evaluate individuals in parallel
int i,j,k;
nthreads = this->numthreads;
pthread_t threads[nthreads];
//create threads
pthread_mutex_init(&this->mutexdata,NULL);
this->whichthread=0;
for(i=0;i<nthreads;i++){
pthread_create(&threads[i],NULL,&ga::staticthreadproc,this);
//printf("creating thread, %d\n",i);
}
//join threads
for(i=0;i<nthreads;i++){
pthread_join(threads[i],NULL);
}
}
};
I am using pthreads here because it works better than boost on machines with less memory.
Each thread is started in eval_thread and terminated there aswell. I am using a mutex to ensure every thread starts with the correct index for the entity_array, as each thread only applys its work to its respective entity_array indexed by the variable this->whichthread. This variable is the only thing that needs to be locked by the mutex as it is updated for every thread and must not be changed by other threads. You can happily ignore everything else apart from the thread_function, eval_threads, and the staticthreadproc as they are the only relevent functions assume that all the other functions apart from init to be both processor and memory intensive.
So my question is why is it that using multithreading in this way is IT more costly in memory and speed than the traditional method of not using threads at all?
I MUST REITERATE THE CODE IS PSEUDO CODE AND THE PROBLEM ISNT WHETHER IT WILL COMPILE
Thanks, I would appreciate any suggestions you might have for pthreads and/or boost solutions.
Each thread requires it's own call-stack, which consumes memory. Every local variable of your function (and all other functions on the call-stack) counts to that memory.
When creating a new thread, space for its call-stack is reserved. I don't know what the default-value is for pthreads, but you might want to look into that. If you know you require less stack-space than is reserved by default, you might be able to reduce memory-consumption significantly by explicitly specifying the desired stack-size when spawning the thread.
As for the performance-part - it could depend on several issues. Generally, you should not expect a performance boost from parallelizing number-crunching operations onto more threads than you have cores (don't know if that is the case here). This might end up being slower, due to the additional overhead of context-switches, increased amount of cache-misses, etc. There are ways to profile this, depending on your platform (for instance, the Visual Studio profiler can count cache-misses, and there are tools for Linux as well).
Creating a thread is quite an expensive operation. If each thread only does a very small amount of work, then your program may be dominated by the time taken to create them all. Also, a large number of active threads can increase the work needed to schedule them, degrading system performance. And, as another answer mentions, each thread requires its own stack memory, so memory usage will grow with the number of threads.
Another issue can be cache invalidation; when one thread writes its results in to its entity structure, it may invalidate nearby cached memory and force other threads to fetch data from higher-level caches or main memory.
You may have better results if you use a smaller number of threads, each processing a larger subset of the data. For a CPU-bound task like this, one thread per CPU is probably best - that means that you can keep all CPUs busy, and there's no need to waste time scheduling multiple threads on each. Make sure that the entities each thread works on are located together in the array, so it does not invalidate those cached by other threads.

Cuda Memory shared in every threads

I started my adventure with CUDA today. I'm trying to share an unsigned int among all the threads. All the threads modify this value. I copied this one value to device by using cudaMemcpy. But, at the end when calculations are finished I received that this value is equal to 0.
Maybe several threads are writing to this variable at the same time?
I'm not sure if I should use any semaphores or lock this variable when a thread starts writing or what.
EDIT:
It's hard to say in more detail because my question is in general how to solve it. Actually I'm not writing any algorithm, only testing CUDA.
But if you wish... I created vector which contains some values (unsigned int). I tried to do something like searching values bigger than given shared value but, when value from vector is bigger, I'm adding 1 to the vector elements and save the shared value.
It looks like the this:
__global__ void method(unsigned int *a, unsigned int *b, long long unsigned N) {
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
if (a[idx]>*b && idx < N)
*b = a[idx]+1;
}
As I said it's not useful code, only for testing, but I wonder how to do it...
"My question is in general how to use shared memory global for every threads."
To read you don't need anything special. What you did works, faster on Fermi devices because they have a cache, slower on the others.
If you are reading the value after other threads changed it you have no way to wait for all threads to finish their operations before reading the value you want so it might not be what you expect. The only way to synchronize a value in global memory between all running threads is to use different kernels. After you change a value you want to share between all threads the kernel finishes and you launch a new one that will work with the shared value.
To make every thread write to the same memory location you must use atomic operations but keep in mind you should keep atomic operations to a minimum as this effectively serializes the execution.
To know the available atomic functions read section B.11 of the CUDA C Programming Guide available here.
What you asked would be:
__global__ void method(unsigned int *a, unsigned int *b, long long unsigned N) {
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
if (a[idx]>*b && idx < N)
//*b = a[idx]+1;
atomicAdd(b, a[idx]+1);
}
If the value is in shared memory it will only be local to every thread that runs in a single multiprocessor(i.e. per thread block) and NOT to every thread that runs for that kernel. You will definitely need to perform atomic operations (such as atomicAdd etc) if you expect each thread to be writing to the variable simultanesouly.
Be aware though that this will serialize all simultaneous thread requests for writing to the variable.
edit - deleted error
Although ideally you don't want to do this - unless you can be sure all the threads are going to take about the same time See Cuda thread tutorial

How to launch another thread from OpenCL code?

My algorithm consists from two steps:
Data generation. On this step I generate data array in cycle as some function result
Data processing. For this step I written OpenCL kernel which process data array generated on previous step.
Now first step runs on CPU because it hard to parallelize. I want to run it on GPU because each step of generation takes some time. And I want to run second step for already generated data immediately.
Can I run another opencl kernel from currently runned kernel in separated thread? Or it be run in the some thread that caller kernel?
Some pseudocode for illustrate my point:
__kernel second(__global int * data, int index) {
//work on data[i]. This process takes a lot of time
}
__kernel first(__global int * data, const int length) {
for (int i = 0; i < length; i++) {
// generate data and store it in data[i]
// This kernel will be launched in some thread that caller or in new thread?
// If in same thread, there are ways to launch it in separated thread?
second(data, i);
}
}
No, OpenCL has no concept of threads, and neither a kernel execution can launch another kernel. All kernel execution is triggered by the CPU.
You should launch one kernel.
Then do a clFInish();
Then execute the next kernel.
There are more efficient ways but I will only mess you with events.
You just use the memory output of the first kernel as input for the second one. With that, you aboid CPU->GPU copy process.
I believe that the global work size might be considered as the number of threads that will be executed, in one way or another. Correct me if I'm wrong.