I am trying to write some code that does AES decryption for an SSL server. To speed it up I am trying to combine multiple packets together to be decrypted on the GPU at one time.
If I just loop over each packet and I submit each kernel to the gpu and then a read that uses the kernels event for its wait. I then collect together the events for all of the reads and wait on them all at the same time but it seems to just run one block at a time and then do the next block. This is not what I would expect. I would expect that if I queue all of the kernels then I hope that the drivers would try doing as much work as possible in parallel.
Am I missing something? Do I have to specify the global worksize to be the size of all of the packet's blocks together and specify the kernels local size to be the size of each packet's blocks?
This is my code for my OpenCL kernel.
__kernel void decryptCBC( __global const uchar *rkey, const uint rounds,
__global const uchar* prev, __global const uchar *data,
__global uchar *result, const uint blocks ) {
const size_t id = get_global_id( 0 );
if( id > blocks ) return;
const size_t startPos = BlockSize * id;
// Create Block
uchar block[BlockSize];
for( uint i = 0; i < BlockSize; i++) block[i] = data[startPos+i];
// Calculate Result
AddRoundKey( rkey, block, rounds );
for( uint j = 1; j < rounds; ++j ){
const uint round = rounds - j;
InverseShiftRows( block );
InverseSubBytes( block );
AddRoundKey( rkey, block, round );
InverseMixColumns( block );
}
InverseSubBytes( block );
InverseShiftRows( block );
AddRoundKey( rkey, block, 0 );
// Store Result
for( uint i = 0; i < BlockSize; i++ ) {
result[startPos+i] = block[i] ^ prev[startPos+i];
}
}
With this kernel, I can beat an 8 core CPU with 125 blocks of data in a single packet. To speed up multiple packets, I attempted to combine together all of the data elements. This involved combining the input data into a single vector and then complications came from the need for each kernel to know where to access within the key leading to two extra arrays containing the number of rounds and the offset of rounds. This turned out to be even slower than the separate execution of a kernel for each packet.
Consider your kernel as a function doing CBC work. As you've found, its chained nature means the CBC task itself is fundamentally serialized. In addition, a GPU prefers to run 16 threads with identical workloads. That's essentially the size of a single task within a multiprocessor core, of which you tend to have dozens; but the management system can only feed them a few of these tasks overall, and the memory system can rarely keep up with them. In addition, loops are one of the worst uses of the kernel, because GPUs are not designed to do much control flow.
So, looking at AES, it operates on 16 byte blocks, but only in bytewise operations. This will be your first dimension - every block should be worked over by 16 threads (probably the local work size in opencl parlance). Make sure to transfer the block to local memory, where all threads can run in lockstep doing random accesses with very low latency. Unroll everything within an AES block operation, using get_local_id(0) to know which byte each thread operates on. Synchronize with barrier(CLK_LOCAL_MEM_FENCE) in case a workgroup runs on a processor that could run out of lockstep. The key can probably go into constant memory, as this can be cached. The block chaining might be an appropriate level to have a loop, if only to avoid reloading the previous block ciphertext from global memory. Also asynchronous storing of completed ciphertext using async_work_group_copy() may help. It's possible you can make a thread do more work by using vectors, but that probably won't help because of steps like shiftRows.
Basically, if any thread within a group of 16 threads (may vary with architectures) gets any different control flow, your GPU is stalling. And if there aren't enough such groups to fill the pipelines and multiprocessors, your GPU is sitting idle. Until you've very carefully optimized the memory accesses, it won't come close to CPU speeds, and even after that, you'll need to have dozens of packets to process at once to avoid giving the GPU too small workgroups. The issue then is that although the GPU can run thousands of threads, its control structure only handles a few workgroups at any time.
One other thing to beware of; when you're using barriers in a workgroup, every thread in the workgroup must execute the same barrier calls. That means even if you have extra threads running idle (for instance, those decrypting a shorter packet within a combined workgroup) they must keep going through the loop even if they make no memory access.
It's not entirely clear from your description, but I think there's some conceptual confusion.
Don't loop over each packet and start a new kernel. You don't need to tell OpenCL to start a bunch of kernels. Instead, upload as many packets as you can to the GPU, then run kernel just once. When you specify the workgroup size, that's how many kernels the GPU tries to run simultaneously.
You will need to program your kernels to each look in a different location in data you uploaded to find their packet. For example, if you were going to add two arrays into a third array, your kernel would look like this:
__kernel void vectorAdd(__global const int* a,
__global const int* b,
__global int* c) {
int idx = get_global_id(0);
c[idx] = a[idx] + b[idx];
}
The important part is that each kernel knows index into the array by using its global id. You'll want to do something similar.
Related
Note: I am using a GT 740, with 2 SMs and 192 CUDA cores per SM.
I have a working CUDA kernel that is executed 4 times:
__global__ void foo(float *d_a, int i) {
if (i < 1500) {
...
...
...
}
}
int main() {
float *d_mem;
cudaMalloc(&d_mem, lots_of_bytes);
for (int i = 0; i < 1500; i += 384)
foo<<<1, 384>>>(d_mem, i);
return 0;
}
Each kernel call reuses the memory allocated to d_mem because of memory constraints.
I would like to modify it to be executed from a single statement, like this:
foo<<<8,192>>>(d_mem);
I want both active thread blocks to access different halves of d_mem, though the specific halves are not important, because data is not shared between blocks.
For example, the following is 1 of several desirable access patterns:
Block 1: d_mem[0] and Block 2: d_mem[1]
Block 3: d_mem[0] and Block 4: d_mem[1]
...
While this is undesirable:
Block 1: d_mem[0] and Block 2: d_mem[0]
Block 3: d_mem[1] and Block 4: d_mem[1]
...
Essentially, I want a way to address d_mem so that any combination of active blocks access different parts of it.
I thought that addressing d_mem with a block's SM ID might work, but it appears that this ID is not guaranteed to remain the same throughout a block's life.
I also considered addressing d_mem with a thread's global ID modulo 2 (threadIdx.x + blockIdx.x * blockDim.x) % 2, but this relies on the blocks being processed in a particular order.
This is mainly relevant to the use of 1 block per SM, but I am also interested in how this could be solved for an arbitrary number of blocks per SM, if possible at all.
Simplest way would be
foo<<<2, 384>>> or foo<<<2, 192>>>
and putting a for loop around the calculations in your kernel. Then you can select the half of memory with blockIdx.x. Even if the two blocks are scheduled on the same SM, it would work. This method also works with more than one block per SM, e.g. (for quarter memory per block)
foo<<<4, 96>>>
Having only 192 threads per SM is inefficient (better at least 384, even better 512, 576, 768, 960 or 1024). The SM needs to hide latencies and switch active threads. If you get into memory problems by having more than 384 (=2 SMs *192) calculations active at the same time, try to think, whether you could utilize more than one thread for the same work package (value of threadIdx.x + i), threads can easily cooperate within warp boundaries or use shared memory. Sometimes it is beneficial to use more threads just for the part of your kernel, where you are reading and writing global memory. Here the largest latencies occur.
So you could call your kernel as
foo<<<2, dim3(4, 192)>>>
and have 4 threads instead of one. For graphics, those 4 could be the rgba channels or xyz coordinates or triangle corners. They can also change their use throughout the kernel.
As a performance optimization this makes some calculations more complicated.
Your if-statement for the current implementation with one block probably should be
if(threadIdx.x + i < 1500)
Is it possible to use the opencl data parallel kernel to sum vector of size N, without doing the partial sum trick?
Say that if you have access to 16 work items and your vector is of size 16. Wouldn't it not be possible to just have a kernel doing the following
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
sum[0] += input[idx];
}
When I've tried this, the sum variable doesn't get updated, but only overwritten. I've read something about using barriers, and i tried inserting a barrier before the summation above, it does update the variable somehow, but it doesn't reproduce the correct sum.
Let me try to explain why sum[0] is overwritten rather than updated.
In your case of 16 work items, there are 16 threads which are running simultaneously. Now sum[0] is a single memory location which is shared by all of the threads, and the line sum[0] += input[idx] is run by each of the 16 threads, simultaneously.
Now the instruction sum[0] += input[idx] (I think) expands performs a read of sum[0], then adds input[idx] to that before writing the result back to sum[0].
There will will be a data race as multiple threads are reading from and writing to the same shared memory location. So what might happen is:
All threads may read the value of sum[0] before any other thread
writes their updated result back to sum[0], in which case the final
result of sum[0] would be the value of input[idx] of the thread
which executed the slowest. Since this will be different each time,
if you run the example multiple times you should see different
results.
Or, one thread may execute slightly more slowly, in which case
another thread may have already written an updated result back to
sum[0] before this slow thread reads sum[0], in which case there
will be an addition using the values of more than one thread, but not
all threads.
So how can you avoid this?
Option 1 - Atomics (Worse Option):
You can use atomics to force all threads to block if another thread is performing an operation on the shared memory location, but this obviously results in a loss of performance since you are making the parallel process serial (and incurring the costs of parallelisation -- such as moving memory between the host and the device and creating the threads).
Option 2 - Reduction (Better Option):
The best solution would be to reduce the array, since you can use the parallelism most effectively, and can give O(log(N)) performance. Here is a good overview of reduction using OpenCL : Reduction Example.
Option 3 (and worst of all)
__kernel void summation(__global float* input, __global float* sum)
{
int idx = get_global_id(0);
for(int j=0;j<N;j++)
{
barrier(CLK_GLOBAL_MEM_FENCE| CLK_LOCAL_MEM_FENCE);
if(idx==j)
sum[0] += input[idx];
else
doOtherWorkWhileSingleCoreSums();
}
}
using a mainstream gpu, this should sum all of them as slow as a pentium mmx . This is just like computing on a single core and giving other cores other jobs but in a slower way.
A cpu device could be better than gpu for this kind.
I'm writing a cuda-based program that needs to periodically transfer a set of items from the GPU to the Host memory. In order to keep the process asynchronous, I was hoping to use cuda's UMA to have a memory buffer and flag in the host memory (so both the GPU and the CPU can access it). The GPU would make sure the flag is clear, add its items to the buffer, and set the flag. The CPU waits for the flag to be set, copies things out of the buffer, and clears the flag. As far as I can see, this doesn't produce any race condition because it forces the GPU and CPU to take turns, always reading and writing to the flag opposite each other.
So far I haven't been able to get this to work because there does seem to be some sort of race condition. I came up with a simpler example that has a similar issue:
#include <stdio.h>
__global__
void uva_counting_test(int n, int *h_i);
int main() {
int *h_i;
int n;
cudaMallocHost(&h_i, sizeof(int));
*h_i = 0;
n = 2;
uva_counting_test<<<1, 1>>>(n, h_i);
//even numbers
for(int i = 1; i <= n; ++i) {
//wait for a change to odd from gpu
while(*h_i == (2*(i - 1)));
printf("host h_i: %d\n", *h_i);
*h_i = 2*i;
}
return 0;
}
__global__
void uva_counting_test(int n, int *h_i) {
//odd numbers
for(int i = 0; i < n; ++i) {
//wait for a change to even from host
while(*h_i == (2*(i - 1) + 1));
*h_i = 2*i + 1;
}
}
For me, this case always hangs after the first print statement from the CPU (host h_i: 1). The really unusual thing (which may be a clue) is that I can get it to work in cuda-gdb. If I run it in cuda-gdb, it will hang as before. If I press ctrl+C, it will bring me to the while() loop line in the kernel. From there, surprisingly, I can tell it to continue and it will finish. For n > 2, it will freeze on the while() loop in the kernel again after each kernel, but I can keep pushing it forward with ctrl+C and continue.
If there's a better way to accomplish what I'm trying to do, that would also be helpful.
You are describing a producer-consumer model, where the GPU is producing some data and from time-to-time the CPU will consume that data.
The simplest way to implement this is to have the CPU be the master. The CPU launches a kernel on the GPU, when it is ready to ready to consume data (i.e. the while loop in your example) it synchronises with the GPU, copies the data back from the GPU, launches the kernel again to generate more data, and does whatever it has to do with the data it copied. This allows you to have the GPU filling a fixed-size buffer while the CPU is processing the previous batch (since there are two copies, one on the GPU and one on the CPU).
That can be improved upon by double-buffering the data, meaning that you can keep the GPU busy producing data 100% of the time by ping-ponging between buffers as you copy the other to the CPU. That assumes the copy-back is faster than the production, but if not then you will saturate the copy bandwidth which is also good.
Neither of those are what you actually described. What you asked for is to have the GPU master the data. I'd urge caution on that since you will need to manage your buffer size carefully and you will need to think carefully about the timings and communication issues. It's certainly possible to do something like that but before you explore that direction you should read up about memory fences, atomic operations, and volatile.
I'd try to add
__threadfence_system();
after
*h_i = 2*i + 1;
See here for details. Without it, it's totally possible that the modification stay in the GPU cache forever. However better you listen to the other answers: to improve it for multiple threads/blocks you have to deal with other "problems" to get a similar scheme to work reliably.
As Tom suggested (+1), better to use double buffering. Streams help a lot such a scheme, as you can find depicted here.
We all know that GPGPU has several stream multiprocesssors(SM) and each has a lot of stream processors(SP) when talking about its hardware architecture. But it introduces another conceptions block and thread in NVDIA's CUDA programming model.
And we also know that block corresponds to SM and thread corresponds to SP, When we launch a CUDA kernel, we configure the kernel as kernel<<<blockNum, threadsNum>>>. I have been writing CUDA program like this for nearly two months. But I still have a lot of questions. A good programmer never just be satisfied with the no-bug program, they want to delve inside and know how the program behaves.
following questions:
Suppose a GPU has 14 SMs and each SM has 48 SPs, we have a kernel like this:
__global__ void double(int *data, int dataNum){
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < dataNum){
data[tid] *= 2;
tid += blockDim.x + blockIdx.x;
}
}
and data is an array of 1024 * 1024 int numbers, kernel configuration as <<<128, 512>>>, it means the grid has 512 * 128 threads and every kernel will iterate (1024 * 1024)/(512 * 128) = 16 times in its while loop. But there are only 14 * 48 SPs, which says that only 14 * 48 threads can simultaneously run no matter how many block numbers or thread numbers in your configuration, what's the point of blockNum and threadNum in the configuration, why not just <<<number of SMs, number of SPs>>>.
And is there any difference between <<<128, 512>>> and <<<64, 512>>>, perhaps the former will iterate 16 times in it while loops and the letter 32 times, but the former has double blocks to schedule. Is there any way to know what's the best configuration, no just compare the result and choose the best, for we could not try every combination, so the result is not complete best, but the best in your attemps.
we know only one block can run a SM one time, but where does the CUDA store the other blocks' context, suppose 512 blocks and 14 SMs, only 14 blocks have their contexts in SMs, how about the other 498 blocks' context?
And we also know that block corresponds to SM and thread corresponds to SP
This is incorrect. An SM can process multiple blocks simultaneously and an SP can process multiple threads simultaneously.
1) I think your question may be due to not separating between the work that an application needs to have done and the resources available to do that work. When you launch a kernel, you specify the work you want to have done. The GPU then uses its resources to perform the work. The more resources a GPU has, the more work it can do in parallel. Any work that can not be done in parallel is done in serial.
By letting the programmer specify the work that needs to be done without tying it to the amount of resources available on a given GPU, CUDA provides an abstraction that allows the app to seamlessly scale to any GPU.
But there are only 14 * 48 SPs, which says that only 14 * 48 threads can simultaneously run no matter how many block numbers or thread numbers in your configuration
SPs are pipelined, so they process many threads simultaneously. Each thread is in a different stage of completion. Each SP can start one operation and yield the result of one operation per clock. Though, as you can see now, even if your statement was correct, it wouldn't lead to your conclusion.
2) Threads in a block can cooperate with each other using shared memory. If your app is not using shared memory, the only implication of block size is performance. Initially, you can get a good value for the block size by using the occupancy calculator. After that, you can further fine tune your block size for performance by testing with different sizes. Since threads are run in groups of 32, called warps, you want to have the block size be divisible by 32. So there are not that many block sizes to test.
3) An SM can run a number of blocks at the same time. The number of blocks depends on how many resources each block requires and how many resources the SM has. A block uses a number of different resources and one of the resources becomes the limiting factor in how many blocks will run simultaneously. The occupancy calculator will tell you what the limiting factor is.
Only blocks that run simultaneously consume resources on an SM. I think those resources are what you mean by context. Blocks that are completed and blocks that are not yet started do not consume resources on an SM.
I started my adventure with CUDA today. I'm trying to share an unsigned int among all the threads. All the threads modify this value. I copied this one value to device by using cudaMemcpy. But, at the end when calculations are finished I received that this value is equal to 0.
Maybe several threads are writing to this variable at the same time?
I'm not sure if I should use any semaphores or lock this variable when a thread starts writing or what.
EDIT:
It's hard to say in more detail because my question is in general how to solve it. Actually I'm not writing any algorithm, only testing CUDA.
But if you wish... I created vector which contains some values (unsigned int). I tried to do something like searching values bigger than given shared value but, when value from vector is bigger, I'm adding 1 to the vector elements and save the shared value.
It looks like the this:
__global__ void method(unsigned int *a, unsigned int *b, long long unsigned N) {
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
if (a[idx]>*b && idx < N)
*b = a[idx]+1;
}
As I said it's not useful code, only for testing, but I wonder how to do it...
"My question is in general how to use shared memory global for every threads."
To read you don't need anything special. What you did works, faster on Fermi devices because they have a cache, slower on the others.
If you are reading the value after other threads changed it you have no way to wait for all threads to finish their operations before reading the value you want so it might not be what you expect. The only way to synchronize a value in global memory between all running threads is to use different kernels. After you change a value you want to share between all threads the kernel finishes and you launch a new one that will work with the shared value.
To make every thread write to the same memory location you must use atomic operations but keep in mind you should keep atomic operations to a minimum as this effectively serializes the execution.
To know the available atomic functions read section B.11 of the CUDA C Programming Guide available here.
What you asked would be:
__global__ void method(unsigned int *a, unsigned int *b, long long unsigned N) {
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
if (a[idx]>*b && idx < N)
//*b = a[idx]+1;
atomicAdd(b, a[idx]+1);
}
If the value is in shared memory it will only be local to every thread that runs in a single multiprocessor(i.e. per thread block) and NOT to every thread that runs for that kernel. You will definitely need to perform atomic operations (such as atomicAdd etc) if you expect each thread to be writing to the variable simultanesouly.
Be aware though that this will serialize all simultaneous thread requests for writing to the variable.
edit - deleted error
Although ideally you don't want to do this - unless you can be sure all the threads are going to take about the same time See Cuda thread tutorial