How to launch another thread from OpenCL code?

How to launch another thread from OpenCL code? - concurrency

My algorithm consists from two steps:
Data generation. On this step I generate data array in cycle as some function result
Data processing. For this step I written OpenCL kernel which process data array generated on previous step.
Now first step runs on CPU because it hard to parallelize. I want to run it on GPU because each step of generation takes some time. And I want to run second step for already generated data immediately.
Can I run another opencl kernel from currently runned kernel in separated thread? Or it be run in the some thread that caller kernel?
Some pseudocode for illustrate my point:
__kernel second(__global int * data, int index) {
//work on data[i]. This process takes a lot of time
}
__kernel first(__global int * data, const int length) {
for (int i = 0; i < length; i++) {
// generate data and store it in data[i]
// This kernel will be launched in some thread that caller or in new thread?
// If in same thread, there are ways to launch it in separated thread?
second(data, i);
}
}

No, OpenCL has no concept of threads, and neither a kernel execution can launch another kernel. All kernel execution is triggered by the CPU.

You should launch one kernel.
Then do a clFInish();
Then execute the next kernel.
There are more efficient ways but I will only mess you with events.
You just use the memory output of the first kernel as input for the second one. With that, you aboid CPU->GPU copy process.

I believe that the global work size might be considered as the number of threads that will be executed, in one way or another. Correct me if I'm wrong.

Related

How to ensure no buffer overruns occur, mulithreading?

I am using ALSA to read in an audio stream. I have set my period size to 960 frames which are received every 20ms. I read in the PCM values using: snd_pcm_readi(handle, buffer, period_size);
Every time I fill this buffer, I need to iterate through it and perform multiple checks on the received values. Iterating through this using a simple for loop takes too long and I get a buffer overrun error on subsequent calls to snd_pcm_readi(). I have been told not to increase the ALSA buffer_size to prevent this from happening. A friend suggested I create a seperate thread to iterate through this buffer and perform the checks? How would this work given that I don't know exactly how long it will take for snd_pcm_readi() to fill the buffer, so knowing when to lock the buffer is a bit confusing to me.

A useful way to multithread an application for signal processing and computation is to use OpenMP (OMP). This avoids the developer having to use locking mechanisms themselves to synchronise multiple computation threads. Typically using locking is a bad thing in real time application programming.
In this example, a for loop is multithreaded in the C language :
#include <omp.h>
int N=960;
float audio[960]; // the signal to process
#pragma omp parallel for
for (int n=0; i<N; n++){ // perform the function on each sample in parallel
printf("n=%d thread num %d\n", n, omp_get_thread_num()); // remove this to speed up the code
audio[n]; // process your signal here
}
You can see a concrete example of this in the gtkIOStream FIR code base. The FIR filter's channels are processed independently, one per available thread.
To initalise the actual OMP subsystem specify the number of threads you want to use like this to maximised the available threads :
int MProcessors = omp_get_max_threads();
omp_set_num_threads(MProcessors);
If you would prefer to look at an approach which uses locking techniques, then you could go for a concurrency pattern such as that developed for the Nuclear Processing.

Is it necessary to use synchronization between two calls to CUDA kernels?

So far I have written programs where a kernel is called only once in the program
So I have a kernel
__global__ void someKernel(float * d_in ){ //Any parameters
//some operation
}
and I basically do
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// Point to notice HERE
}
It works fine. However this time I want to call the kernel not only once but many times
Something like
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// From here
//Some unrelated calculations here
dothis();
dothat();
//Then again the kernel repeteadly
for(k: some_ks)
{
// Do some pre-calculations
//call the kernel
someKernel<< <nblocks,512>> >(.......);
// some post calculations
}
}
My question is should I use some kind of synchronization between calling the kernel the first time and calling the kernel in the for loops (and in each iteration)
Perhaps cudaDeviceSynchronize or other? or it is not necessary?

Additional synchronization would not be necessary in this case for at least 2 reasons.
cudaMemcpy is a synchronizing call already. It blocks the CPU thread and waits until all previous CUDA activity issued to that device is complete, before it allows the data transfer to begin. Once the data transfer is complete, the CPU thread is allowed to proceed.
CUDA activity issued to a single device will not overlap in any way unless using CUDA streams. You are not using streams. Therefore even asynchronous work issued to the device will execute in issue order. Item A and B issued to the device in that order will not overlap with each other. Item A will complete before item B is allowed to begin. This is a principal CUDA streams semantic point.

Calling thrust function inside a CUDA Kernel global_

I've read about that the dynamic parallelism is supported in the newer version of CUDA, and I can call thrust functions like thrush::exclusive_scan inside a kernel function with thrust::device parameter.
__global__ void kernel(int* inarray, int n, int *result) {
extern __shared__ int s[];
int t = threadIdx.x;
s[t] = inarray[t];
__syncthreads();
thrust::exclusive_scan(thrust::device, s, n, result);
__syncthreads();
}
int main() {
// prep work
kernel<<<1, n, n * sizeof(int)>>>(inarray, n, result);
}
The thing I got confused is:
When calling thrust function inside a kernel, does each thread call the function once and they all do a dynamic parallelism on the data?
If they do, I only need one thread to call thrust so I can just do a if to threadIdx; if not, how do threads in a block communicate with each other that the call to thrust has been done and they should just ignore it(this seems a little imaginary since there wouldn't be a systematical way to ensure from user's code). To summerize, what's exactly happening when I call thrust functions with thrust::device parameter inside a kernel?

Every thread in your kernel that executes the thrust algorithm will execute a separate copy of your algorithm. The threads in your kernel do not cooperate on a single algorithm call.
If you have met all the requirements (HW/SW and compilation settings) for a CUDA dynamic parallelism (CDP) call, then each thread that encounters the thrust algorithm call will launch a CDP child kernel to perform the thrust algorithm (in that case, the threads in the CDP child kernel do cooperate). If not, each thread that encounters the thrust algorithm call will perform it as if you had specified thrust::seq instead of thrust::device.
If you prefer to avoid the CDP activity in an otherwise CDP-capable environment, you can specify thrust::seq instead.
If you intend, for example, that only a single copy of your thrust algorithm be executed, it will be necessary in your kernel code to ensure that only one thread calls it, for example:
if (!threadIdx.x) thrust::exclusive_scan(...
or similar.
Questions around synchronization before/after the call are no different from ordinary CUDA code. If you need all threads in the block to wait for the thrust algorithm to complete, use e.g. __syncthreads(), (and cudaDeviceSynchronize() in the CDP case).
The information here may possibly be of interest as well.

What is the correct way to use queue.flush() and queue.finish() after calling a Kernel?

I am using opencl 1.2 c++ wrapper for my project. I want to know what is the correct method to call my kernel. In my case, I have 2 devices and the data should be sent simultaneously to them.
I am dividing my data into two chunks and both the devices should be able to perform computations on them separately. They have no interconnection and they don't need to know what is happening in the other device.
When the data is sent to both the devices, I want to wait for the kernels to finish before my program goes further. Because I will be using results returned from both of the kernels. So I don't want to start reading the data before the kernels have returned.
I have 2 methods. Which one is programmatically correct in my case:
Method 1:
for (int i = 0; i < numberOfDevices; i++) {
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
arguments etc...);
queue[i].flush();
}
// Wait for the kernels to return.
for (int i = 0; i < numberOfDevices; i++) {
queue[i].finish();
}
Method 2:
for (int i = 0; i < numberOfDevices; i++) {
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
arguments etc...);
}
for (int i = 0; i < numberOfDevices; i++) {
queue[i].flush();
}
// Wait for the kernels to return.
for (int i = 0; i < numberOfDevices; i++) {
queue[i].finish();
}
Or none of them are correct and there is a better way to wait for my kernels to return?

Assuming each device Computes in its own memory:
I would go for multi threaded (for) loop version of your method-1. Because opencl doesnt force vendors to do asynchronous enqueuing. Nvidia for example, does synchronous enqueuing for some drivers and hardware while amd has asynchronous enqueuing.
When each device is driven by a separate thread, they should enqueue Write+Compute together before synchronising for reading partial results(second threaded loop)
Having multiple threads also advantageous for spin-wait type synchronization (clfinish) because multiple spin-wait loops are worked in parallel. This should save time in Order of a millisecond.
Flush helps some vendors like amd to start enqueueing Early.
To have correct input and correct output for all devices, only two finish commands are enough. One After Write+Compute then one After read(results). So each device get same time step data and produce results at same time step. Write and Compute doesnt need finish between them if queue type is in-order because it Computes one by one. Also this doesnt need read operations to be blocking.
Trivial finish commands always Kill performance.
Note: I already wrote a load balancer using all this, and it performs better When using event-based synchronization instead of finish. Finish is easier but has bigger synchronization times than an event based one.
Also single queue doesnt always push a gpu to its limits. Using at Least 4 queues per device ensures Latency hiding of Write and Compute on my amd system. Sometimes even 16 queues help a bit more. But for io bottlenecked situations May need even more.
Example:
thread1
Write
Compute
Synchronization with other thread
Thread2
Write
Compute
Synchronization with other thread
Thread 1
Read
Synchronization with other thread
Thread2
Read
Synchronization with other thread
Trivial synchronization Kills performance because drivers dont know your intention and they leave it as it is So you should elliminate unnecessary finish commands and convert blocking Writes to nonblocking ones where you can.
Zero synchronization is also wrong because opencl doesnt force vendors to start computing After several enqueues. It May indefinitely grow to gifabytes of memory in minutes or even seconds.

You should use Method 1. clFlush is the only way of guaranteeing that commands are issued to the device (and not just buffered somewhere before sending).

Cuda unified memory between gpu and host

I'm writing a cuda-based program that needs to periodically transfer a set of items from the GPU to the Host memory. In order to keep the process asynchronous, I was hoping to use cuda's UMA to have a memory buffer and flag in the host memory (so both the GPU and the CPU can access it). The GPU would make sure the flag is clear, add its items to the buffer, and set the flag. The CPU waits for the flag to be set, copies things out of the buffer, and clears the flag. As far as I can see, this doesn't produce any race condition because it forces the GPU and CPU to take turns, always reading and writing to the flag opposite each other.
So far I haven't been able to get this to work because there does seem to be some sort of race condition. I came up with a simpler example that has a similar issue:
#include <stdio.h>
__global__
void uva_counting_test(int n, int *h_i);
int main() {
int *h_i;
int n;
cudaMallocHost(&h_i, sizeof(int));
*h_i = 0;
n = 2;
uva_counting_test<<<1, 1>>>(n, h_i);
//even numbers
for(int i = 1; i <= n; ++i) {
//wait for a change to odd from gpu
while(*h_i == (2*(i - 1)));
printf("host h_i: %d\n", *h_i);
*h_i = 2*i;
}
return 0;
}
__global__
void uva_counting_test(int n, int *h_i) {
//odd numbers
for(int i = 0; i < n; ++i) {
//wait for a change to even from host
while(*h_i == (2*(i - 1) + 1));
*h_i = 2*i + 1;
}
}
For me, this case always hangs after the first print statement from the CPU (host h_i: 1). The really unusual thing (which may be a clue) is that I can get it to work in cuda-gdb. If I run it in cuda-gdb, it will hang as before. If I press ctrl+C, it will bring me to the while() loop line in the kernel. From there, surprisingly, I can tell it to continue and it will finish. For n > 2, it will freeze on the while() loop in the kernel again after each kernel, but I can keep pushing it forward with ctrl+C and continue.
If there's a better way to accomplish what I'm trying to do, that would also be helpful.

You are describing a producer-consumer model, where the GPU is producing some data and from time-to-time the CPU will consume that data.
The simplest way to implement this is to have the CPU be the master. The CPU launches a kernel on the GPU, when it is ready to ready to consume data (i.e. the while loop in your example) it synchronises with the GPU, copies the data back from the GPU, launches the kernel again to generate more data, and does whatever it has to do with the data it copied. This allows you to have the GPU filling a fixed-size buffer while the CPU is processing the previous batch (since there are two copies, one on the GPU and one on the CPU).
That can be improved upon by double-buffering the data, meaning that you can keep the GPU busy producing data 100% of the time by ping-ponging between buffers as you copy the other to the CPU. That assumes the copy-back is faster than the production, but if not then you will saturate the copy bandwidth which is also good.
Neither of those are what you actually described. What you asked for is to have the GPU master the data. I'd urge caution on that since you will need to manage your buffer size carefully and you will need to think carefully about the timings and communication issues. It's certainly possible to do something like that but before you explore that direction you should read up about memory fences, atomic operations, and volatile.

I'd try to add
__threadfence_system();
after
*h_i = 2*i + 1;
See here for details. Without it, it's totally possible that the modification stay in the GPU cache forever. However better you listen to the other answers: to improve it for multiple threads/blocks you have to deal with other "problems" to get a similar scheme to work reliably.
As Tom suggested (+1), better to use double buffering. Streams help a lot such a scheme, as you can find depicted here.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js