CUDA pinned memory and coalescing - c++

On a compute capability 2.x device how would I make sure that the gpu uses coalesced memory access when using mapped pinned memory and assuming that normally when using global memory the 2D data would require padding?
I can't seem to find information about this anywhere, perhaps I should be looking better or perhaps I am missing something. Any pointers in the right direction are welcome...

The coalescing approach should be applied when using zero copy memory. Quoting the CUDA C BEST PRACTICES GUIDE:
Because the data is not cached on the GPU, mapped
pinned memory should be read or written only once, and the global loads and stores
that read and write the memory should be coalesced.
Quoting the "CUDA Programming" book, by S. Cook
If you think about what happens with access to global memory, an entire cache line is brought in from memory on compute 2.x hardware. Even on compute 1.x hardware the same 128 bytes, potentially reduced to 64 or 32, is fetched from global memory.
NVIDIA does not publish the size of the PCI-E transfers it uses, or details on how zero copy is actually implemented. However, the coalescing approach used for global memory could be used with PCI-E transfer. The warp memory latency hiding model can equally be applied to PCI-E transfers, providing there is enough arithmetic density to hide the latency of the PCI-E transfers.

Related

Does cudaMallocManaged() create a synchronized buffer in RAM and VRAM?

In an Nvidia developer blog: An Even Easier Introduction to CUDA the writer explains:
To compute on the GPU, I need to allocate memory accessible by the
GPU. Unified Memory in CUDA makes this easy by providing a single
memory space accessible by all GPUs and CPUs in your system. To
allocate data in unified memory, call cudaMallocManaged(), which
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
I found this both interesting (since it seems potentially convenient) and confusing:
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
For this to be true, it seems like cudaMallocManaged() must be syncing 2 buffers across VRAM and RAM. Is this the case? Or is my understanding lacking?
In my work so far with GPU acceleration on top of the WebGL abstraction layer via GPU.js, I learned the distinct performance difference between passing VRAM based buffers (textures in WebGL) from kernel to kernel (keeping the buffer on the GPU, highly performant) and retrieving the buffer value outside of the kernels to access it in RAM through JavaScript (pulling the buffer off the GPU, taking a performance hit since buffers in VRAM on the GPU don't magically move to RAM).
Forgive my highly abstracted understanding / description of the topic, since I know most CUDA / C++ devs have a much more granular understanding of the process.
So is cudaMallocManaged() creating synchronized buffers in both RAM
and VRAM for convenience of the developer?
If so, wouldn't doing so come with an unnecessary cost in cases where
we might never need to touch that buffer with the CPU?
Does the compiler perhaps just check if we ever reference that buffer
from CPU and never create the CPU side of the synced buffer if it's
not needed?
Or do I have it all wrong? Are we not even talking VRAM? How does
this work?
So is cudaMallocManaged() creating synchronized buffers in both RAM and VRAM for convenience of the developer?
Yes, more or less. The "synchronization" is referred to in the managed memory model as migration of data. Virtual address carveouts are made for all visible processors, and the data is migrated (i.e. moved to, and provided a physical allocation for) the processor that attempts to access it.
If so, wouldn't doing so come with an unnecessary cost in cases where we might never need to touch that buffer with the CPU?
If you never need to touch the buffer on the CPU, then what will happen is that the VA carveout will be made in the CPU VA space, but no physical allocation will be made for it. When the GPU attempts to actually access the data, it will cause the allocation to "appear" and use up GPU memory. Although there are "costs" to be sure, there is no usage of CPU (physical) memory in this case. Furthermore, once instantiated in GPU memory, there should be no ongoing additional cost for the GPU to access it; it should run at "full" speed. The instantiation/migration process is a complex one, and what I am describing here is what I would consider the "principal" modality or behavior. There are many factors that could affect this.
Does the compiler perhaps just check if we ever reference that buffer from CPU and never create the CPU side of the synced buffer if it's not needed?
No, this is managed by the runtime, not compile time.
Or do I have it all wrong? Are we not even talking VRAM? How does this work?
No you don't have it all wrong. Yes we are talking about VRAM.
The blog you reference barely touches on managed memory, which is a fairly involved subject. There are numerous online resources to learn more about it. You might want to review some of them. here is one. There are good GTC presentations on managed memory, including here. There is also an entire section of the CUDA programming guide covering managed memory.

Share large constant data among cuda threads

I have a kernel which is called multiple times. In each call a constant data of around 240 kbytes will be shared and processed by threads. Threads work independently like a map function. The stalling time of the threads is considerable. The reason behind that can be the bank conflict of memory reads. How can I handle this?(I have GTX 1080 ti)
Can "const global" of opencl handle this? (because constant memory in cuda is limited to 64 kb)
In CUDA, I believe the best recommendation would be to make use of the so called "Read-Only" cache. This has at least two possible benefits over the __constant__ memory/constant cache system:
It is not limited to 64kB like __constant__ memory is.
It does not expect or require "uniform access" like the constant cache does, to deliver full access bandwidth/best performance. Uniform access refers to the idea that all threads in a warp are accessing the same location or same constant memory value (per read cycle/instruction).
The read-only cache is documented in the CUDA programming guide. Possibly the easiest way to use it is to decorate your pointers passed to the CUDA kernel with __restrict__ (assuming you are not aliasing between pointers) and to decorate the pointer that refers to the large constant data with const ... __restrict__. This will allow the compiler to generate appropriate LDG instructions for access to constant data, pulling it through the read-only cache mechanism.
This read-only cache mechanism is only supported on cc 3.5 or higher GPUs, but that covers some GPUs in the Kepler generation and all GPUs in the Maxwell, Pascal (including your GTX 1080 ti), Volta, and Turing generations.
If you have a GPU that is less than cc3.5, possibly the best suggestion for similar benefits (larger than __const__, not needing uniform access) then would be to use texture memory. This is also documented elsewhere in the programming guide, there are various CUDA sample codes that demonstrate the use of texture memory, and plenty of questions here on the SO cuda tag covering it as well.
Constant memory that doesn't fit in the hardware's constant buffer will typically "spill" into global memory on OpenCL. Bank conflicts are usually an issue with local memory, however, so that's probably not it. I'm assuming CUDA's 64kiB constant limit reflects nvidia's hardware, so OpenCL isn't going to magically perform better here.
Reading global memory without a predictable pattern can of course be slow, however, especially if you don't have sufficient thread occupancy and arithmetic to mask the memory latency.
Without knowing anything further about your problem space, this also brings me to the directions you could take further optimisations, assuming your global memory reads are the issue:
Reduce the amount of constant/global data you need, for example by using more efficient types, other compression mechanisms, or computing some of the values on the fly (possibly storing them in local memory for all threads in a group to share).
Cluster the most frequently used data in a small constant buffer, and explicitly place the more rarely used constants in a global buffer. This may help the runtime lay it out more efficiently in the hardware. If that doesn't help, try to copy the frequently used data into local memory, and make your thread groups large and comparatively long-running to hide the copying hit.
Check if thread occupancy could be improved. It usually can, and this tends to give you substantial performance improvements in almost any situation. (except if your code is already extremely ALU/FPU bound)

When should I prefer write-combined CUDA-allocated mapped host memory?

the cudaHostAlloc() API call has, among others, the flags:
cudaHostAllocMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer().
cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.
I could quite understand when exactly I would prefer the "write-combined" option. I mean, it didn't say the transfer may be faster just in one direction, so why do they only recommend it for one direction? Also, which kind of systems benefit from this "write-combining"?
I read this white paper, Section 4.7, and still could not get it. Ok, so reading by the CPU is inefficient; but what if other benefits offset this inefficiency? Or - if they cannot, why can't they?
An elucidation would be appreciated.
Write-combined memory allows the CPU to combine multiple narrow writes into fewer wider writes, thus increasing the efficiency of memory writes. If memory serves, WC memory was first introduced with the Intel PentiumPro around 1995 to speed up CPU writes into the frame buffer of video cards. I am not up to speed on which modern system platforms use or support this.
The efficiency of reads performed by the CPU is going to be the same for both cudaHostAllocMapped and cudaHostAllocWriteCombined. But because the latter allows more efficient writes by the CPU, it is recommended for "buffers that will be written by the CPU and read by the device", as stated by quoted documentation.

Where are mapped device memory to, in virtual addressing, when using Intel I/OAT?

When I use Intel I/OAT for DMA zero-copy/zero-cycles(without CPU) transfer through async_memcpy, then where are mapped device memory to, in virtual addressing: to the kernel-buffer(kernel space) or to the user-buffer(user space)?
And does it make any sense to use I/OAT in modern x86_64 CPUs (when CPU-core can fast access to the RAM without north-bridge of chipset)?
http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html
Given that the memory is physical memory, it can be any memory that the kernel can address, including both kernel buffers and user-space buffers. It does however have to be "pinned" or "locked", so that the memory doesn't get taken away (e.g. someone doing free on the memory should not release the memory back to the OS for reassignment to another process, because you could get very interesting effects if that is the case). This is of course the same rules that apply to various other DMA accesses.
I doubt very much this helps in copying data structures for your average user-mode application. On the other hand, I don't believe Intel would put these sort of features into the processor unless they thought it was beneficial in some way. The way I understand it is that it's helpful for copying the network receive buffer into user-mode application that is receiving the data, with less CPU involvement. It doesn't necessarily speed up the actual memory transfer much (if at all), but it offloads the CPU from the to do other things.
I'm pretty sure I saw something not so long ago about this technology [or something very similar] also going into the latest models of processors, so I expect there is some advantage to it.

CUDA Zero Copy memory considerations

I am trying to figure out if using cudaHostAlloc (or cudaMallocHost?) is appropriate.
I am trying to run a kernel where my input data is more than the amount available on the GPU.
Can I cudaMallocHost more space than there is on the GPU? If not, and lets say I allocate 1/4 the space that I need (which will fit on the GPU), is there any advantage to using pinned memory?
I would essentially have to still copy from that 1/4 sized buffer into my full size malloc'd buffer and that's probably no faster than just using normal cudaMalloc right?
Is this typical usage scenario correct for using cudaMallocHost:
allocate pinned host memory (lets call it "h_p")
populate h_p with input data-
get device pointer on GPU for h_p
run kernel using that device pointer to modify contents of array-
use h_p like normal, which now has modified contents-
So - no copy has to happy between step 4 and 5 right?
if that is correct, then I can see the advantage for kernels that will fit on the GPU all at once at least
Memory transfer is an important factor when it comes to the performance of CUDA applications. cudaMallocHost can do two things:
allocate pinned memory: this is page-locked host memory that the CUDA runtime can track. If host memory allocated this way is involved in cudaMemcpy as either source or destination, the CUDA runtime will be able to perform an optimized memory transfer.
allocate mapped memory: this is also page-locked memory that can be used in kernel code directly as it is mapped to CUDA address space. To do this you have to set the cudaDeviceMapHost flag using cudaSetDeviceFlags before using any other CUDA function. The GPU memory size does not limit the size of mapped host memory.
I'm not sure about the performance of the latter technique. It could allow you to overlap computation and communication very nicely.
If you access the memory in blocks inside your kernel (i.e. you don't need the entire data but only a section) you could use a multi-buffering method utilizing asynchronous memory transfers with cudaMemcpyAsync by having multiple-buffers on the GPU: compute on one buffer, transfer one buffer to host and transfer one buffer to device at the same time.
I believe your assertions about the usage scenario are correct when using cudaDeviceMapHost type of allocation. You do not have to do an explicit copy but there certainly will be an implicit copy that you don't see. There's a chance it overlaps nicely with your computation. Note that you might need to synchronize the kernel call to make sure the kernel finished and that you have the modified content in h_p.
Using host memory would be orders of magnitude slower than on-device memory. It has both very high latency and very limited throughput. For example capacity of PCIe x16 is mere 8GB/s when bandwidth of device memory on GTX460 is 108GB/s
Neither the CUDA C Programming Guide, nor the CUDA Best Practices Guide mention that the amount allocated by cudaMallocHost can 't be bigger than the device memory so I conclude it's possible.
Data transfers from page locked memory to the device are faster than normal data transfers and even faster if using write-combined memory. Also, the memory allocated this way can be mapped into device memory space eliminating the need to (manually) copy the data at all. It happens automatic as the data is needed so you should be able to process more data than fits into device memory.
However, system performance (of the host) can greatly suffer, if the page-locked amount makes up a significant part of the host memory.
So when to use this technique?, simple: If the data needs be read only once and written only once, use it. It will yield a performance gain, since one would've to copy data back and forth at some point anyway. But as soon as the need to store intermediate results, that don't fit into registers or shared memory, arises, process chunks of your data that fit into device memory with cudaMalloc.
Yes, you can cudaMallocHost more space than there is on the gpu.
Pinned memory can have higher bandwidth, but can decrease host performance. It is very easy to switch between normal host memory, pinned memory, write-combined memory, and even mapped (zero-copy) memory. Why don't you use normal host memory first and compare the performance?
Yes, your usage scenario should work.
Keep in mind that global device memory access is slow, and zero-copy host memory access is even slower. Whether zero-copy is right for you depends entirely on how you use the memory.
Also consider use of streams for overlapping data transfer/ kernel execution.
This provides gpu work on chunks of data