OpenCL variables or array in kernel cost memory? - c++

I am trying to run the following code about OpenCL. In kernel function, I will define an array int arr[1000] = {0};
kernel void test()
{
int arr[1000] = {0};
}
Then I will create N threads to run the kernel.
cl::CommandQueue cmdQueue;
cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(N), cl::NullRange); // kernel here is the one running test()
My question is, since we know that OpenCL will parallel run the threads, does it mean that, the peak memory will be N * 1000 * sizeof(int)?

This is not the way to OpenCL (yes, that's what I meant :).
The kernel function operates on kernel operands passed in from the host (CPU) - so you'd allocate your array on the host using clCreateBuffer and set the arg using clSetKernelArg. Your kernel does not declare/allocate the device memory, but simply receives it as an __global argument. Now when you run the kernel using clEnqueueNDRangeKernel, the OpenCL implementation will allocate 1000 ints and run a thread on each of those ints.
If, on the other hand you meant to allocate 1000 ints per work-item (device thread), your calculation is right (yes, they cost memory from the local pool) but it probably won't work. OpenCL work-items have access to only local memory (see here on how to check this for your device) which is severely limited.

Related

Is it necessary to use synchronization between two calls to CUDA kernels?

So far I have written programs where a kernel is called only once in the program
So I have a kernel
__global__ void someKernel(float * d_in ){ //Any parameters
//some operation
}
and I basically do
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// Point to notice HERE
}
It works fine. However this time I want to call the kernel not only once but many times
Something like
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// From here
//Some unrelated calculations here
dothis();
dothat();
//Then again the kernel repeteadly
for(k: some_ks)
{
// Do some pre-calculations
//call the kernel
someKernel<< <nblocks,512>> >(.......);
// some post calculations
}
}
My question is should I use some kind of synchronization between calling the kernel the first time and calling the kernel in the for loops (and in each iteration)
Perhaps cudaDeviceSynchronize or other? or it is not necessary?
Additional synchronization would not be necessary in this case for at least 2 reasons.
cudaMemcpy is a synchronizing call already. It blocks the CPU thread and waits until all previous CUDA activity issued to that device is complete, before it allows the data transfer to begin. Once the data transfer is complete, the CPU thread is allowed to proceed.
CUDA activity issued to a single device will not overlap in any way unless using CUDA streams. You are not using streams. Therefore even asynchronous work issued to the device will execute in issue order. Item A and B issued to the device in that order will not overlap with each other. Item A will complete before item B is allowed to begin. This is a principal CUDA streams semantic point.

How to avoid constant memory copying in OpenCL

I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing.
OpenCL kernel is taking two-dimensional (n x n) array of temperatures values and its size (n). It returns new array with temperatures after each cycle:
pseudocode:
int t_id = get_global_id(0);
if(t_id < n * n)
{
m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures
}
As You can see, every thread is computing single cell in matrix. When host application needs to perform X computing cycles it looks like this
For 1 ... X
Copy memory to OpenCL device
Call kernel
Copy memory back
I would like to rewrite kernel code to perform all X cycles without constant memory copying to/from OpenCL device.
Copy memory to OpenCL device
Call kernel X times OR call kernel one time and make it compute X cycles.
Copy memory back
I know that each thread in kernel should lock when all other threads are doing their job and after that - m[][] and m_new[][] should be swapped. I have no idea how to implement any of those two functionalities.
Or maybe there is another way to do this optimally?
Copy memory to OpenCL device
Call kernel X times
Copy memory back
this works. Make sure call kernel is not blocking(so 1-2 ms per cycle is saved) and there aren't any host-accesible buffer properties such as USE_HOST_PTR or ALLOC_HOST_PTR.
If calling kernel X times doesn't get satisfactory performance, you can try using single workgroup(such as only 256 threads) with looping X times that each cycles has a barrier() at the end so all 256 threads synchronize before starting next cycle. This way you can compute M different heat-flow problems at the same time where M is number of compute units(or workgroups) if that is a server, it can serve that many computations.
Global synchronization is not possible because when latest threads are launched, first threads are already gone. It works with (number of compute units)(number of threads per workgroup)(number of wavefronts per workgroup) threads concurrently. For example, a R7-240 gpu with 5 compute units and local-range=256, it can run maybe 5*256*20=25k threads at a time.
Then, for further performance, you can apply local-memory optimizations.

Allocate CUDA device memory for a point cloud with increasing dimension (number of point)

I'm writing a program in which I need to:
make a test on each pixel of an image
if test result is TRUE I have to add a point to a point cloud
if test result is FALSE, make nothing
I've already wrote a working code on CPU side C++.
Now I need to speed it up using CUDA. My idea was to make some block/thread (one thread per pixel I guess) execute the test in parallel and, if the test result is TRUE, make the thread to add a point to the cloud.
Here comes my trouble: How can I allocate space in device memory for a Point cloud (using cudaMalloc or similar) if I don't know a priori the number of point that I will insert in the cloud?
Do I have to allocate a fixed amount of memory and then increasing it everytime the point cloud reach the limit dimension? Or is there a method to "dynamically" allocate the memory?
When you allocate memory on the device, you may do so with two API calls: one is the malloc as described by Taro, but it is limited by some internal driver limit (8 MB by default), which can be increased by setting the appropriate limit with cudaDeviceSetLimit with parameter cudaLimitMallocHeapSize.
Alternately, you may use cudaMalloc within a kernel, as it is both a host and device API method.
In both cases, Taro's observation stands: you will allocate a new different buffer, as it would do on CPU by the way. Hence, using a single buffer might result in a need for a copy of data. Note that cudaMemcpy is not a device API method, hence, you may need to write your own.
To my knowledge, there is no such thing as realloc in the CUDA API.
Back to your original issue, you might want to implement your algorithm in three phases: First phase would count the number of samples you need, second phase would allocate the data array and third phase feed the data array. To implement this, you may use atomic functions to increment some int that counts the number of samples.
I would like to post this as a comment, as it only partially answers, but it is too long for this.
Yes, you can dynamically allocate memory from the kernels.
You can call malloc() and free() within your kernels to dynamically allocate and free memory during computation, as explained in the B-16 section of the CUDA 7.5 Programming Guide :
__global__ void mallocTest()
{
size_t size = 123;
char* ptr = (char*)malloc(size);
memset(ptr, 0, size);
printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);
free(ptr);
}
int main()
{
// Set a heap size of 128 megabytes. Note that this must
// be done before any kernel is launched.
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<1, 5>>>();
cudaDeviceSynchronize();
return 0;
}
(You will need the compute capability 2.x or higher)
But by doing this you allocate a new and different buffer in memory, you don't make your previously - and allocated by the host - buffer "grow" like a CPU dynamic container (vector, list, etc).
I think you should set a constant setting the maximum size of your array, then allocating the maximum size, and making your kernel incrementing the "really used size" in this maximum buffer.
If doing so, don't forget to make this increment atomic/synchronized to count each increment from each concurrent thread.

Access to Shared Memory in CUDA

I'm passing 3 arrays, with size N=224, to a kernel. The kernel itself calls another function foo(threadIdx.x) and foo calls another function bar(i) where i goes from 1 to 224. The second function needs to access the arrays passed to the kernel, but the code I have now tells me that the argument i is undefined.
I tried to save a copy of arrays into a shared memory but it didn't work::
__global__ void dummy(double *pos_x_d, double *pos_y_d, double *hist_d){
int i = threadIdx.x;
hist_d[i]=pos_x_d[i]+pos_y_d[i];
__syncthreads();
foo(i);
__syncthreads();
}
The Host code looks like::
cudaMalloc((void **) &pos_x_d,(N*sizeof(double)));
cudaMalloc((void **) &pos_y_d,(N*sizeof(double)));
cudaMalloc((void **) &hist_d,(N*sizeof(double)));
//Copy data to GPU
cudaMemcpy((void *)pos_x_d, (void*)pos_x_h,N*sizeof(double),cudaMemcpyHostToDevice);
cudaMemcpy((void *)pos_y_d, (void*)pos_y_h,N*sizeof(double),cudaMemcpyHostToDevice);
//Launch Kernel
dummy<<<1,224>>>(pos_x_d,pos_y_d,hist_d);
Is it possible to launch two kernels, 1st to send data to shared memory. Then, the second to do the calculations? I also need to loop over the second kernel, which is why I wanted to send data in the shared memory in the first place. The error is coming from line 89, 90 which means it has to do with the shared memory. Complete code is here
Is it possible to launch two kernels, 1st to send data to shared memory. Then, the second to do the calculations?
No, it's not possible. The lifetime of shared memory is the threadblock associated with that shared memory. A threadblock cannot reliably use the values stored by a different threadblock (whether from the same or different kernel launch) in shared memory.
The only way to save data from one kernel launch to the next is via global memory (or host memory).

OpenCL AES Parallelization

I am trying to write some code that does AES decryption for an SSL server. To speed it up I am trying to combine multiple packets together to be decrypted on the GPU at one time.
If I just loop over each packet and I submit each kernel to the gpu and then a read that uses the kernels event for its wait. I then collect together the events for all of the reads and wait on them all at the same time but it seems to just run one block at a time and then do the next block. This is not what I would expect. I would expect that if I queue all of the kernels then I hope that the drivers would try doing as much work as possible in parallel.
Am I missing something? Do I have to specify the global worksize to be the size of all of the packet's blocks together and specify the kernels local size to be the size of each packet's blocks?
This is my code for my OpenCL kernel.
__kernel void decryptCBC( __global const uchar *rkey, const uint rounds,
__global const uchar* prev, __global const uchar *data,
__global uchar *result, const uint blocks ) {
const size_t id = get_global_id( 0 );
if( id > blocks ) return;
const size_t startPos = BlockSize * id;
// Create Block
uchar block[BlockSize];
for( uint i = 0; i < BlockSize; i++) block[i] = data[startPos+i];
// Calculate Result
AddRoundKey( rkey, block, rounds );
for( uint j = 1; j < rounds; ++j ){
const uint round = rounds - j;
InverseShiftRows( block );
InverseSubBytes( block );
AddRoundKey( rkey, block, round );
InverseMixColumns( block );
}
InverseSubBytes( block );
InverseShiftRows( block );
AddRoundKey( rkey, block, 0 );
// Store Result
for( uint i = 0; i < BlockSize; i++ ) {
result[startPos+i] = block[i] ^ prev[startPos+i];
}
}
With this kernel, I can beat an 8 core CPU with 125 blocks of data in a single packet. To speed up multiple packets, I attempted to combine together all of the data elements. This involved combining the input data into a single vector and then complications came from the need for each kernel to know where to access within the key leading to two extra arrays containing the number of rounds and the offset of rounds. This turned out to be even slower than the separate execution of a kernel for each packet.
Consider your kernel as a function doing CBC work. As you've found, its chained nature means the CBC task itself is fundamentally serialized. In addition, a GPU prefers to run 16 threads with identical workloads. That's essentially the size of a single task within a multiprocessor core, of which you tend to have dozens; but the management system can only feed them a few of these tasks overall, and the memory system can rarely keep up with them. In addition, loops are one of the worst uses of the kernel, because GPUs are not designed to do much control flow.
So, looking at AES, it operates on 16 byte blocks, but only in bytewise operations. This will be your first dimension - every block should be worked over by 16 threads (probably the local work size in opencl parlance). Make sure to transfer the block to local memory, where all threads can run in lockstep doing random accesses with very low latency. Unroll everything within an AES block operation, using get_local_id(0) to know which byte each thread operates on. Synchronize with barrier(CLK_LOCAL_MEM_FENCE) in case a workgroup runs on a processor that could run out of lockstep. The key can probably go into constant memory, as this can be cached. The block chaining might be an appropriate level to have a loop, if only to avoid reloading the previous block ciphertext from global memory. Also asynchronous storing of completed ciphertext using async_work_group_copy() may help. It's possible you can make a thread do more work by using vectors, but that probably won't help because of steps like shiftRows.
Basically, if any thread within a group of 16 threads (may vary with architectures) gets any different control flow, your GPU is stalling. And if there aren't enough such groups to fill the pipelines and multiprocessors, your GPU is sitting idle. Until you've very carefully optimized the memory accesses, it won't come close to CPU speeds, and even after that, you'll need to have dozens of packets to process at once to avoid giving the GPU too small workgroups. The issue then is that although the GPU can run thousands of threads, its control structure only handles a few workgroups at any time.
One other thing to beware of; when you're using barriers in a workgroup, every thread in the workgroup must execute the same barrier calls. That means even if you have extra threads running idle (for instance, those decrypting a shorter packet within a combined workgroup) they must keep going through the loop even if they make no memory access.
It's not entirely clear from your description, but I think there's some conceptual confusion.
Don't loop over each packet and start a new kernel. You don't need to tell OpenCL to start a bunch of kernels. Instead, upload as many packets as you can to the GPU, then run kernel just once. When you specify the workgroup size, that's how many kernels the GPU tries to run simultaneously.
You will need to program your kernels to each look in a different location in data you uploaded to find their packet. For example, if you were going to add two arrays into a third array, your kernel would look like this:
__kernel void vectorAdd(__global const int* a,
__global const int* b,
__global int* c) {
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
The important part is that each kernel knows index into the array by using its global id. You'll want to do something similar.