I have a kernel which is called multiple times. In each call a constant data of around 240 kbytes will be shared and processed by threads. Threads work independently like a map function. The stalling time of the threads is considerable. The reason behind that can be the bank conflict of memory reads. How can I handle this?(I have GTX 1080 ti)
Can "const global" of opencl handle this? (because constant memory in cuda is limited to 64 kb)
In CUDA, I believe the best recommendation would be to make use of the so called "Read-Only" cache. This has at least two possible benefits over the __constant__ memory/constant cache system:
It is not limited to 64kB like __constant__ memory is.
It does not expect or require "uniform access" like the constant cache does, to deliver full access bandwidth/best performance. Uniform access refers to the idea that all threads in a warp are accessing the same location or same constant memory value (per read cycle/instruction).
The read-only cache is documented in the CUDA programming guide. Possibly the easiest way to use it is to decorate your pointers passed to the CUDA kernel with __restrict__ (assuming you are not aliasing between pointers) and to decorate the pointer that refers to the large constant data with const ... __restrict__. This will allow the compiler to generate appropriate LDG instructions for access to constant data, pulling it through the read-only cache mechanism.
This read-only cache mechanism is only supported on cc 3.5 or higher GPUs, but that covers some GPUs in the Kepler generation and all GPUs in the Maxwell, Pascal (including your GTX 1080 ti), Volta, and Turing generations.
If you have a GPU that is less than cc3.5, possibly the best suggestion for similar benefits (larger than __const__, not needing uniform access) then would be to use texture memory. This is also documented elsewhere in the programming guide, there are various CUDA sample codes that demonstrate the use of texture memory, and plenty of questions here on the SO cuda tag covering it as well.
Constant memory that doesn't fit in the hardware's constant buffer will typically "spill" into global memory on OpenCL. Bank conflicts are usually an issue with local memory, however, so that's probably not it. I'm assuming CUDA's 64kiB constant limit reflects nvidia's hardware, so OpenCL isn't going to magically perform better here.
Reading global memory without a predictable pattern can of course be slow, however, especially if you don't have sufficient thread occupancy and arithmetic to mask the memory latency.
Without knowing anything further about your problem space, this also brings me to the directions you could take further optimisations, assuming your global memory reads are the issue:
Reduce the amount of constant/global data you need, for example by using more efficient types, other compression mechanisms, or computing some of the values on the fly (possibly storing them in local memory for all threads in a group to share).
Cluster the most frequently used data in a small constant buffer, and explicitly place the more rarely used constants in a global buffer. This may help the runtime lay it out more efficiently in the hardware. If that doesn't help, try to copy the frequently used data into local memory, and make your thread groups large and comparatively long-running to hide the copying hit.
Check if thread occupancy could be improved. It usually can, and this tends to give you substantial performance improvements in almost any situation. (except if your code is already extremely ALU/FPU bound)
Related
In an Nvidia developer blog: An Even Easier Introduction to CUDA the writer explains:
To compute on the GPU, I need to allocate memory accessible by the
GPU. Unified Memory in CUDA makes this easy by providing a single
memory space accessible by all GPUs and CPUs in your system. To
allocate data in unified memory, call cudaMallocManaged(), which
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
I found this both interesting (since it seems potentially convenient) and confusing:
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
For this to be true, it seems like cudaMallocManaged() must be syncing 2 buffers across VRAM and RAM. Is this the case? Or is my understanding lacking?
In my work so far with GPU acceleration on top of the WebGL abstraction layer via GPU.js, I learned the distinct performance difference between passing VRAM based buffers (textures in WebGL) from kernel to kernel (keeping the buffer on the GPU, highly performant) and retrieving the buffer value outside of the kernels to access it in RAM through JavaScript (pulling the buffer off the GPU, taking a performance hit since buffers in VRAM on the GPU don't magically move to RAM).
Forgive my highly abstracted understanding / description of the topic, since I know most CUDA / C++ devs have a much more granular understanding of the process.
So is cudaMallocManaged() creating synchronized buffers in both RAM
and VRAM for convenience of the developer?
If so, wouldn't doing so come with an unnecessary cost in cases where
we might never need to touch that buffer with the CPU?
Does the compiler perhaps just check if we ever reference that buffer
from CPU and never create the CPU side of the synced buffer if it's
not needed?
Or do I have it all wrong? Are we not even talking VRAM? How does
this work?
So is cudaMallocManaged() creating synchronized buffers in both RAM and VRAM for convenience of the developer?
Yes, more or less. The "synchronization" is referred to in the managed memory model as migration of data. Virtual address carveouts are made for all visible processors, and the data is migrated (i.e. moved to, and provided a physical allocation for) the processor that attempts to access it.
If so, wouldn't doing so come with an unnecessary cost in cases where we might never need to touch that buffer with the CPU?
If you never need to touch the buffer on the CPU, then what will happen is that the VA carveout will be made in the CPU VA space, but no physical allocation will be made for it. When the GPU attempts to actually access the data, it will cause the allocation to "appear" and use up GPU memory. Although there are "costs" to be sure, there is no usage of CPU (physical) memory in this case. Furthermore, once instantiated in GPU memory, there should be no ongoing additional cost for the GPU to access it; it should run at "full" speed. The instantiation/migration process is a complex one, and what I am describing here is what I would consider the "principal" modality or behavior. There are many factors that could affect this.
Does the compiler perhaps just check if we ever reference that buffer from CPU and never create the CPU side of the synced buffer if it's not needed?
No, this is managed by the runtime, not compile time.
Or do I have it all wrong? Are we not even talking VRAM? How does this work?
No you don't have it all wrong. Yes we are talking about VRAM.
The blog you reference barely touches on managed memory, which is a fairly involved subject. There are numerous online resources to learn more about it. You might want to review some of them. here is one. There are good GTC presentations on managed memory, including here. There is also an entire section of the CUDA programming guide covering managed memory.
I would like to have a class of varying number n of objects which are easily iterated over as a group, with each object member having a large list (20+) of individually modified variables influencing class methods. Before I started learning OOP, I would just make a 2D array and load the variable values into each row, corresponding to each object, and then append/delete rows as needed. Is this still a good solution? Is there a better solution?
Again, in this case I am more interested in pushing processor performance rather than preserving abstraction and modularity, etc. In this respect, I am very confused about the way the data container ultimately is read into the L1 cache, and how to ensure that I do not induce page inefficiency or cache-misses. If for example, I have a 128 kb cache, I assume the entire container should fit into this cache to be efficient, correct?
According to Agner Fog's optimization manual, the C++ Standard Template Library is rather inefficient, because it makes extensive use of dynamic memory allocation. However, a fixed size array that is made larger than necessary (e.g. because the needed size is not known at compile time) can also be bad for performance, because a larger size means that it won't fit into the cache as easily. In such situations, the STL's dynamic memory allocation could perform better.
Generally, it is best to store your data in contiguous memory. You can use a fixed size array or an std::vector for this. However, before using std::vector, you should call std::vector::reserve() for performance reasons, so that the memory does not have to be reallocated too often. If you reallocate too often, the heap could become fragmented, which is also bad for cache performance.
Ideally, the data that you are working on will fit entirely into the Level 1 data cache (which is about 32 KB on modern desktop processors). However, even if it doesn't fit, the Level 2 cache is much larger (about 512 KB) and the Level 3 Cache is several Megabytes. The higher-level caches are still significantly faster than reading from main memory.
It is best if your memory access patterns are predictable, so that the hardware prefetcher can do its work best. Sequential memory accesses are easiest for the hardware prefetcher to predict.
The CPU cache works best if you access the same data several times and if the data is small enough to be kept in the cache. However, even if the data is used only once, the CPU cache can still make the memory access faster, by making use of prefetching.
A cache miss will occur if
the data is being accessed for the first time and the hardware prefetcher was not able to predict and prefetch the needed memory address in time, or
the data is no longer cached, because the cache had to make room for other data, due to the data being too large to fit in the cache.
In addition to the hardware prefetcher attempting to predict needed memory addresses in advance (which is automatic), it is also possible for the programmer to explicity issue a software prefetch. However, from what I have read, it is hard to get significant performance gains from doing this, except under very special circumstances.
At about what code complexity do OpenACC kernels lose efficiency on common GPU and register, shared memory operations or some other aspect starts to bottleneck performance?
Also is there some point where too few tasks and overhead of transferring to GPU and cores would become a bottleneck?
Would cache sizes and if code fits indicate optimal task per kernel or something else?
About how big is the OpenACC overhead per kernel compared to potential performance and does it vary a lot with various directives?
I would refrain from using the complexity of the code as an indication of performance. You can have a highly complex code run very efficiently on a GPU and a simple code run poorly. Instead, I would look at the following factors:
Data movement between the device and host. Limit the frequency of data movement and try to transfer data in contiguous chunks. Use OpenACC unstructured data regions to match the host allocation on the device (i.e. use "enter data" at the same time as you allocate data via "new" or "malloc"). Move as much compute to the GPU as you can and only use the OpenACC update directive to synchronize host and device data when absolutely necessary. In case where data movement is unavoidable, investigate using the "async" clause to interleave the data movement with compute.
Data access on the device and limiting memory divergence. Be sure to have your data layout so that the stride-1 (contiguous) dimension of your arrays are accessed contiguously across the vectors.
Have a high compute intensity which is the ratio of computation to data movement. More compute and less data movement the better. However, lower compute intensity loops are fine if there are other high intensity loops and the cost to move the data to the host would offset the cost of running the kernel on the device.
Avoid allocating data on the device since it forces threads to serialize. This includes using Fortran "automatic" arrays, and declaring C++ objects with include allocation in their constructors.
Avoid atomic operations. Atomic operations are actually quite efficient when compared to host atomics, but still should be avoided if possible.
Avoid subroutine calls. Try to inline routines when possible.
Occupancy. Occupancy is the ratio of the number of threads that can potentially be running on the GPU over the maximum number of threads that could be running. Note that 100% occupancy does not guarantee high performance but you should try and get above 50% if possible. The limiters to occupancy are the number of registers used per thread (vector) and shared memory used per block (gang). Assuming you're using the PGI compiler, you can see the limits of your device by running the PGI "pgaccelinfo" utility. The number of registers used will depend upon the number of local scalars used (explicitly declared by the programmer and temporaries created by the compiler to hold intermediate calculations) and the amount of shared memory used will be determined by the OpenACC "cache" and "private" directives when "private" is used on a "gang" loop. You can see the how much each kernel uses by adding the flag "-ta=tesla:ptxinfo". You can limit the number of registers used per thread via "-ta=tesla:maxregcount:". Reducing the number of registers will increase the occupancy but also increase the number of register spills. Spills are fine so long as they only spill to L1/L2 cache. Spilling to global memory will hurt performance. It's often better to suffer lower occupancy than spilling to global memory.
Note that I highly recommend using a profiler (PGPROF, NVprof, Score-P, TAU, Vampir, etc.) to help discover a program's performance bottlenecks.
On a compute capability 2.x device how would I make sure that the gpu uses coalesced memory access when using mapped pinned memory and assuming that normally when using global memory the 2D data would require padding?
I can't seem to find information about this anywhere, perhaps I should be looking better or perhaps I am missing something. Any pointers in the right direction are welcome...
The coalescing approach should be applied when using zero copy memory. Quoting the CUDA C BEST PRACTICES GUIDE:
Because the data is not cached on the GPU, mapped
pinned memory should be read or written only once, and the global loads and stores
that read and write the memory should be coalesced.
Quoting the "CUDA Programming" book, by S. Cook
If you think about what happens with access to global memory, an entire cache line is brought in from memory on compute 2.x hardware. Even on compute 1.x hardware the same 128 bytes, potentially reduced to 64 or 32, is fetched from global memory.
NVIDIA does not publish the size of the PCI-E transfers it uses, or details on how zero copy is actually implemented. However, the coalescing approach used for global memory could be used with PCI-E transfer. The warp memory latency hiding model can equally be applied to PCI-E transfers, providing there is enough arithmetic density to hide the latency of the PCI-E transfers.
i always read that it is slow to allocate and transfer data form cpu to gpu. is this because cudaMalloc is slow? is it because cudaMemcpy is slow? or is it becuase both of them are slow?
It is mostly tied to 2 things, the first begin the speed of the PCIExpress bus between the card and the cpu. The other is tied to the way these functions operate. Now, I think the new CUDA 4 has better support for memory allocation (standard or pinned) and a way to access memory transparently across the bus.
Now, let's face it, at some point, you'll need to get data from point A to point B to compute something. Best way to handle is to either have a really large computation going on or use CUDA streams to overlap transfer and computation on the GPU.
In most applications, you should be doing cudaMalloc once at the beginning and then not call it any more. Thus, the bottleneck is really cudaMemcpy.
This is due to physical limitations. For a standard PCI-E 2.0 x16 link, you'll get 8GB/s theoretical but typically 5-6GB/s in practice. Compare this w/ even a mid range Fermi like the GTX460 which has 80+GB/s bandwidth on the device. You're in effect taking an order of magnitude hit in memory bandwidth, spiking your data transfer times accordingly.
GPGPUs are supposed to be supercomputers and I believe Seymour Cray (the supercomputer guy) said, "a supercomputer turns compute-bound problems into I/O bound problems". Thus, optimizing data transfers is everything.
In my personal experience, iterative algorithms are the ones that by far show the best improvements by porting to GPGPU (2-3 orders of magnitude) due to the fact that you can eliminate transfer time by keeping everything in-situ on the GPU.