At what code complexity does an OpenACC kernel lose efficiency on common GPU? - c++

At about what code complexity do OpenACC kernels lose efficiency on common GPU and register, shared memory operations or some other aspect starts to bottleneck performance?
Also is there some point where too few tasks and overhead of transferring to GPU and cores would become a bottleneck?
Would cache sizes and if code fits indicate optimal task per kernel or something else?
About how big is the OpenACC overhead per kernel compared to potential performance and does it vary a lot with various directives?

I would refrain from using the complexity of the code as an indication of performance. You can have a highly complex code run very efficiently on a GPU and a simple code run poorly. Instead, I would look at the following factors:
Data movement between the device and host. Limit the frequency of data movement and try to transfer data in contiguous chunks. Use OpenACC unstructured data regions to match the host allocation on the device (i.e. use "enter data" at the same time as you allocate data via "new" or "malloc"). Move as much compute to the GPU as you can and only use the OpenACC update directive to synchronize host and device data when absolutely necessary. In case where data movement is unavoidable, investigate using the "async" clause to interleave the data movement with compute.
Data access on the device and limiting memory divergence. Be sure to have your data layout so that the stride-1 (contiguous) dimension of your arrays are accessed contiguously across the vectors.
Have a high compute intensity which is the ratio of computation to data movement. More compute and less data movement the better. However, lower compute intensity loops are fine if there are other high intensity loops and the cost to move the data to the host would offset the cost of running the kernel on the device.
Avoid allocating data on the device since it forces threads to serialize. This includes using Fortran "automatic" arrays, and declaring C++ objects with include allocation in their constructors.
Avoid atomic operations. Atomic operations are actually quite efficient when compared to host atomics, but still should be avoided if possible.
Avoid subroutine calls. Try to inline routines when possible.
Occupancy. Occupancy is the ratio of the number of threads that can potentially be running on the GPU over the maximum number of threads that could be running. Note that 100% occupancy does not guarantee high performance but you should try and get above 50% if possible. The limiters to occupancy are the number of registers used per thread (vector) and shared memory used per block (gang). Assuming you're using the PGI compiler, you can see the limits of your device by running the PGI "pgaccelinfo" utility. The number of registers used will depend upon the number of local scalars used (explicitly declared by the programmer and temporaries created by the compiler to hold intermediate calculations) and the amount of shared memory used will be determined by the OpenACC "cache" and "private" directives when "private" is used on a "gang" loop. You can see the how much each kernel uses by adding the flag "-ta=tesla:ptxinfo". You can limit the number of registers used per thread via "-ta=tesla:maxregcount:". Reducing the number of registers will increase the occupancy but also increase the number of register spills. Spills are fine so long as they only spill to L1/L2 cache. Spilling to global memory will hurt performance. It's often better to suffer lower occupancy than spilling to global memory.
Note that I highly recommend using a profiler (PGPROF, NVprof, Score-P, TAU, Vampir, etc.) to help discover a program's performance bottlenecks.

Related

Share large constant data among cuda threads

I have a kernel which is called multiple times. In each call a constant data of around 240 kbytes will be shared and processed by threads. Threads work independently like a map function. The stalling time of the threads is considerable. The reason behind that can be the bank conflict of memory reads. How can I handle this?(I have GTX 1080 ti)
Can "const global" of opencl handle this? (because constant memory in cuda is limited to 64 kb)
In CUDA, I believe the best recommendation would be to make use of the so called "Read-Only" cache. This has at least two possible benefits over the __constant__ memory/constant cache system:
It is not limited to 64kB like __constant__ memory is.
It does not expect or require "uniform access" like the constant cache does, to deliver full access bandwidth/best performance. Uniform access refers to the idea that all threads in a warp are accessing the same location or same constant memory value (per read cycle/instruction).
The read-only cache is documented in the CUDA programming guide. Possibly the easiest way to use it is to decorate your pointers passed to the CUDA kernel with __restrict__ (assuming you are not aliasing between pointers) and to decorate the pointer that refers to the large constant data with const ... __restrict__. This will allow the compiler to generate appropriate LDG instructions for access to constant data, pulling it through the read-only cache mechanism.
This read-only cache mechanism is only supported on cc 3.5 or higher GPUs, but that covers some GPUs in the Kepler generation and all GPUs in the Maxwell, Pascal (including your GTX 1080 ti), Volta, and Turing generations.
If you have a GPU that is less than cc3.5, possibly the best suggestion for similar benefits (larger than __const__, not needing uniform access) then would be to use texture memory. This is also documented elsewhere in the programming guide, there are various CUDA sample codes that demonstrate the use of texture memory, and plenty of questions here on the SO cuda tag covering it as well.
Constant memory that doesn't fit in the hardware's constant buffer will typically "spill" into global memory on OpenCL. Bank conflicts are usually an issue with local memory, however, so that's probably not it. I'm assuming CUDA's 64kiB constant limit reflects nvidia's hardware, so OpenCL isn't going to magically perform better here.
Reading global memory without a predictable pattern can of course be slow, however, especially if you don't have sufficient thread occupancy and arithmetic to mask the memory latency.
Without knowing anything further about your problem space, this also brings me to the directions you could take further optimisations, assuming your global memory reads are the issue:
Reduce the amount of constant/global data you need, for example by using more efficient types, other compression mechanisms, or computing some of the values on the fly (possibly storing them in local memory for all threads in a group to share).
Cluster the most frequently used data in a small constant buffer, and explicitly place the more rarely used constants in a global buffer. This may help the runtime lay it out more efficiently in the hardware. If that doesn't help, try to copy the frequently used data into local memory, and make your thread groups large and comparatively long-running to hide the copying hit.
Check if thread occupancy could be improved. It usually can, and this tends to give you substantial performance improvements in almost any situation. (except if your code is already extremely ALU/FPU bound)

Running a single block with multiple threads, CUDA

I know that you should generally have at least 32 threads running per block on CUDA since threads are executed in groups of 32. However I was wondering if it is considered an acceptable practice to have only one block with a bunch of threads (I know there is a limit on the number of threads). I am asking this because I have some problems which require the shared memory of threads and synchronization across every element of the computation. I want to launch my kernel like
computeSomething<<< 1, 256 >>>(...)
and just used the threads to do the computation.
Is this efficient to just have one block, or would I be better off just doing the computation on the cpu?
If you care about performance, it's a bad idea.
The principal reason is that a given threadblock can only occupy the resources of a single SM on a GPU. Since most GPUs have 2 or more SMs, this means you're leaving somewhere between 50% to over 90% of the GPU performance untouched.
For performance, both of these kernel configurations are bad:
kernel<<<1, N>>>(...);
and
kernel<<<N, 1>>>(...);
The first is the case you're asking about. The second is the case of a single thread per threadblock; this leaves about 97% of the GPU horsepower untouched.
In addition to the above considerations, GPUs are latency hiding machines and like to have a lot of threads, warps, and threadblocks available, to select work from, to hide latency. Having lots of available threads helps the GPU to hide latency, which generally will result in higher efficiency (work accomplished per unit time.)
It's impossible to tell if it would be faster on the CPU. You would have to benchmark and compare. If all of the data is already on the GPU, and you would have to move it back to the CPU to do the work, and then move the results back to the GPU, then it might still be faster to use the GPU in a relatively inefficient way, in order to avoid the overhead of moving data around.

Cuda zero-copy performance

Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory) memory model?
I have a kernel that uses the zero-copy feature and with NVVP I see the following:
Running the kernel on an average problem size I get instruction replay overhead of 0.7%, so nothing major. And all of this 0.7% is global memory replay overhead.
When I really jack up the problem size, I get an instruction replay overhead of 95.7%, all of which is due to global memory replay overhead.
However, the global load efficiency and global store efficiency for both the normal problem size kernel run and the very very large problem size kernel run are the same. I'm not really sure what to make of this combination of metrics.
The main thing I'm not sure of is which statistics in NVVP will help me see what is going on with the zero copy feature. Any ideas of what type of statistics I should be looking at?
Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:
The memory operation was for a size specifier (vector type) that requires multiple transactions in order to perform the address divergence calculation and communicate data to/from the L1 cache.
The memory operation had thread address divergence requiring access to multiple cache lines.
The memory transaction missed the L1 cache. When the miss value is returned to L1 the L1 notifies the warp scheduler to replay the instruction.
The LSU unit resources are full and the instruction needs to be replayed when the resource are available.
The latency to
L2 is 200-400 cycles
device memory (dram) is 400-800 cycles
zero copy memory over PCIe is 1000s of cycles
The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.
The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.
The profiler's exposes the following counters:
gld_throughput
l1_cache_global_hit_rate
dram_{read, write}_throughput
l2_l1_read_hit_rate
In the zero copy case all of these metrics should be much lower.
The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).

Using Nsight to determine bank conflicts and coalescing

How can i know the number of non Coalesced read/write and bank conflicts using parallel nsight?
Moreover what should i look at when i use nsight is a profiler? what are the important fields that may cause my program to slow down?
I don't use NSight, but typical fields that you'll look at with a profiler are basically:
memory consumption
time spent in functions
More specifically, with CUDA, you'll be careful to your GPU's occupancy.
Other interesting values are the way the compiler has set your local variables: in registers or in local memory.
Finally, you'll check the time spent to transfer data to and back from the GPU, and compare it with the computation time.
For bank conflicts, you need to watch warp serialization. See here.
And here is a discussion about monitoring memory coalescence <-- basically you just need to watch Global Memory Loads/Stores - Coalesced/Uncoalesced and flag the Uncoalesced.
M. Tibbits basically answered what you need to know for bank conflicts and non-coalesced memory transactions.
For the question on what are the important fields/ things to look at (when using the Nsight profiler) that may cause my program to slow down:
Use Application or System Trace to determine if you are CPU bound, memory bound, or kernel bound. This can be done by looking at the Timeline.
a. CPU bound – you will see large areas where no kernel or memory copy is occurring but your application threads (Thread State) is Green
b. Memory bound – kernels execution blocked on memory transfers to or from the device. You can see this by looking at the Memory Row. If you are spending a lot of time in Memory Copies then you should consider using CUDA streams to pipeline your application. This can allow you to overlap memory transfers and kernels. Before changing your code you should compare the duration of the transfers and kernels and make sure you will get a performance gain.
c. Kernel bound – If the majority of the application time is spent waiting on kernels to complete then you should switch to the "Profile" activity, re-run your application, and start collecting hardware counters to see how you can make your kernel's actual execution time faster.

CUDA - operations on single elements of a matrix - getting ideas

I'm about writing a CUDA kernel to perform a single operation on every single element of a matrix (e.g. squarerooting every element, or exponentiation, or calculating the sine/cosine if all the numbers are between [-1;1], etc..)
I chose the blocks/threads grid dimensions and I think the code is pretty straightforward and simple, but I'm asking myself... what can I do to maximize coalescence/SM occupancy?
My first idea was: making all semiwarp (16 threads) load data ensemble from global memory and then putting them all to compute, but it finds out that there are no enough memory-transfer/calculations parallelization.. I mean all threads load data, then compute, then load again data, then calculate again.. this sounds really poor in terms of performance.
I thought using shared memory would be great, maybe using some sort of locality to make a thread load more data than it actually needs to facilitate other threads' work, but this sounds stupid too because the second would wait for the former to finish loading data before starting its work.
I'm not really sure I gave the right idea regarding my problem, I'm just getting ideas before commencing to work on something concrete.
Every comment/suggestion/critic is well accepted, and thanks.
If you have defined the grid so that threads read along the major dimension of the array containing your matrix, then you have already guaranteed coalesced memory access, and there is little else to be done to improve performance. These sort of O(N) complexity operations really do not contain sufficient arithmetic intensity to give good parallel speed up over an optimized CPU implementation. Often the best strategy is to fuse multiple O(N) operations together into a single kernel to improve the FLOP to memory transaction ratio.
In my eyes your problem is this
load data ensemble from global memory
It seems that your algorithm idea is:
Do something on cpu - have some matrix
Transfer matrix from global to device memory
Perform your operation on every element
Transfer matrix back from device to global memory
Do something else on cpu - go sometimes back 1.
This kind of computations are almost everytime I/O-bandwidth limited (IO = memory IO), not computation power limited. GPGPU computations can sustain a very high memory bandwidth - but only from device memory to the gpu - transfer from global memory goes always over the very slow PCIe (slow compared to the device memory connection, that can deliver up to 160 GB/s + on fast cards). So one main thing to get good results is to keep the data (matrix) in device memory - preferable generate it even there if possible (depends on your problem). Never try to migrate data between cpu and gpu for and back as the transfer overhead eats all your speedup up. Also keep in mind that your matrix must have a certain size to amortize the transfer overhead, that you cant avoid (to compute a matrix with 10 x 10 elements would bring almost nothing, heck it would even cost more)
The interchanging transfer/compute/transfer is full ok, thats how such gpu algorithms work - but only if the the tranfer is from device memory.
The GPU for something this trivial is overkill and will be slower than just keeping it on the CPU. Especially if you have a multicore CPU.
I have seen many projects showing the "great" advantages of the GPU over the CPU. They rarely stand up to scrutiny. Of course, goofy managers who want to impress their managers want to show how "leading edge" his group is.
Someone in the department toils months on getting silly GPU code optimized (which is generally 8x harder to read than equivalent CPU code), then have the "equivalent" CPU code written by some Indian sweat shop (the programmer whose last project was PGP), compile it with the slowest version of gcc they can find, with no optimization, then tout their 2x speed improvement. And BTW, many overlook I/O speed as somehow not important.