When should I prefer write-combined CUDA-allocated mapped host memory?

When should I prefer write-combined CUDA-allocated mapped host memory? - c++

the cudaHostAlloc() API call has, among others, the flags:
cudaHostAllocMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer().
cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.
I could quite understand when exactly I would prefer the "write-combined" option. I mean, it didn't say the transfer may be faster just in one direction, so why do they only recommend it for one direction? Also, which kind of systems benefit from this "write-combining"?
I read this white paper, Section 4.7, and still could not get it. Ok, so reading by the CPU is inefficient; but what if other benefits offset this inefficiency? Or - if they cannot, why can't they?
An elucidation would be appreciated.

Write-combined memory allows the CPU to combine multiple narrow writes into fewer wider writes, thus increasing the efficiency of memory writes. If memory serves, WC memory was first introduced with the Intel PentiumPro around 1995 to speed up CPU writes into the frame buffer of video cards. I am not up to speed on which modern system platforms use or support this.
The efficiency of reads performed by the CPU is going to be the same for both cudaHostAllocMapped and cudaHostAllocWriteCombined. But because the latter allows more efficient writes by the CPU, it is recommended for "buffers that will be written by the CPU and read by the device", as stated by quoted documentation.

Related

Does cudaMallocManaged() create a synchronized buffer in RAM and VRAM?

In an Nvidia developer blog: An Even Easier Introduction to CUDA the writer explains:
To compute on the GPU, I need to allocate memory accessible by the
GPU. Unified Memory in CUDA makes this easy by providing a single
memory space accessible by all GPUs and CPUs in your system. To
allocate data in unified memory, call cudaMallocManaged(), which
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
I found this both interesting (since it seems potentially convenient) and confusing:
returns a pointer that you can access from host (CPU) code or device
(GPU) code.
For this to be true, it seems like cudaMallocManaged() must be syncing 2 buffers across VRAM and RAM. Is this the case? Or is my understanding lacking?
In my work so far with GPU acceleration on top of the WebGL abstraction layer via GPU.js, I learned the distinct performance difference between passing VRAM based buffers (textures in WebGL) from kernel to kernel (keeping the buffer on the GPU, highly performant) and retrieving the buffer value outside of the kernels to access it in RAM through JavaScript (pulling the buffer off the GPU, taking a performance hit since buffers in VRAM on the GPU don't magically move to RAM).
Forgive my highly abstracted understanding / description of the topic, since I know most CUDA / C++ devs have a much more granular understanding of the process.
So is cudaMallocManaged() creating synchronized buffers in both RAM
and VRAM for convenience of the developer?
If so, wouldn't doing so come with an unnecessary cost in cases where
we might never need to touch that buffer with the CPU?
Does the compiler perhaps just check if we ever reference that buffer
from CPU and never create the CPU side of the synced buffer if it's
not needed?
Or do I have it all wrong? Are we not even talking VRAM? How does
this work?

So is cudaMallocManaged() creating synchronized buffers in both RAM and VRAM for convenience of the developer?
Yes, more or less. The "synchronization" is referred to in the managed memory model as migration of data. Virtual address carveouts are made for all visible processors, and the data is migrated (i.e. moved to, and provided a physical allocation for) the processor that attempts to access it.
If so, wouldn't doing so come with an unnecessary cost in cases where we might never need to touch that buffer with the CPU?
If you never need to touch the buffer on the CPU, then what will happen is that the VA carveout will be made in the CPU VA space, but no physical allocation will be made for it. When the GPU attempts to actually access the data, it will cause the allocation to "appear" and use up GPU memory. Although there are "costs" to be sure, there is no usage of CPU (physical) memory in this case. Furthermore, once instantiated in GPU memory, there should be no ongoing additional cost for the GPU to access it; it should run at "full" speed. The instantiation/migration process is a complex one, and what I am describing here is what I would consider the "principal" modality or behavior. There are many factors that could affect this.
Does the compiler perhaps just check if we ever reference that buffer from CPU and never create the CPU side of the synced buffer if it's not needed?
No, this is managed by the runtime, not compile time.
Or do I have it all wrong? Are we not even talking VRAM? How does this work?
No you don't have it all wrong. Yes we are talking about VRAM.
The blog you reference barely touches on managed memory, which is a fairly involved subject. There are numerous online resources to learn more about it. You might want to review some of them. here is one. There are good GTC presentations on managed memory, including here. There is also an entire section of the CUDA programming guide covering managed memory.

CUDA pinned memory and coalescing

On a compute capability 2.x device how would I make sure that the gpu uses coalesced memory access when using mapped pinned memory and assuming that normally when using global memory the 2D data would require padding?
I can't seem to find information about this anywhere, perhaps I should be looking better or perhaps I am missing something. Any pointers in the right direction are welcome...

The coalescing approach should be applied when using zero copy memory. Quoting the CUDA C BEST PRACTICES GUIDE:
Because the data is not cached on the GPU, mapped
pinned memory should be read or written only once, and the global loads and stores
that read and write the memory should be coalesced.
Quoting the "CUDA Programming" book, by S. Cook
If you think about what happens with access to global memory, an entire cache line is brought in from memory on compute 2.x hardware. Even on compute 1.x hardware the same 128 bytes, potentially reduced to 64 or 32, is fetched from global memory.
NVIDIA does not publish the size of the PCI-E transfers it uses, or details on how zero copy is actually implemented. However, the coalescing approach used for global memory could be used with PCI-E transfer. The warp memory latency hiding model can equally be applied to PCI-E transfers, providing there is enough arithmetic density to hide the latency of the PCI-E transfers.

Where are mapped device memory to, in virtual addressing, when using Intel I/OAT?

When I use Intel I/OAT for DMA zero-copy/zero-cycles(without CPU) transfer through async_memcpy, then where are mapped device memory to, in virtual addressing: to the kernel-buffer(kernel space) or to the user-buffer(user space)?
And does it make any sense to use I/OAT in modern x86_64 CPUs (when CPU-core can fast access to the RAM without north-bridge of chipset)?
http://www.intel.com/content/www/us/en/wireless-network/accel-technology.html

Given that the memory is physical memory, it can be any memory that the kernel can address, including both kernel buffers and user-space buffers. It does however have to be "pinned" or "locked", so that the memory doesn't get taken away (e.g. someone doing free on the memory should not release the memory back to the OS for reassignment to another process, because you could get very interesting effects if that is the case). This is of course the same rules that apply to various other DMA accesses.
I doubt very much this helps in copying data structures for your average user-mode application. On the other hand, I don't believe Intel would put these sort of features into the processor unless they thought it was beneficial in some way. The way I understand it is that it's helpful for copying the network receive buffer into user-mode application that is receiving the data, with less CPU involvement. It doesn't necessarily speed up the actual memory transfer much (if at all), but it offloads the CPU from the to do other things.
I'm pretty sure I saw something not so long ago about this technology [or something very similar] also going into the latest models of processors, so I expect there is some advantage to it.

CUDA Zero Copy memory considerations

I am trying to figure out if using cudaHostAlloc (or cudaMallocHost?) is appropriate.
I am trying to run a kernel where my input data is more than the amount available on the GPU.
Can I cudaMallocHost more space than there is on the GPU? If not, and lets say I allocate 1/4 the space that I need (which will fit on the GPU), is there any advantage to using pinned memory?
I would essentially have to still copy from that 1/4 sized buffer into my full size malloc'd buffer and that's probably no faster than just using normal cudaMalloc right?
Is this typical usage scenario correct for using cudaMallocHost:
allocate pinned host memory (lets call it "h_p")
populate h_p with input data-
get device pointer on GPU for h_p
run kernel using that device pointer to modify contents of array-
use h_p like normal, which now has modified contents-
So - no copy has to happy between step 4 and 5 right?
if that is correct, then I can see the advantage for kernels that will fit on the GPU all at once at least

Memory transfer is an important factor when it comes to the performance of CUDA applications. cudaMallocHost can do two things:
allocate pinned memory: this is page-locked host memory that the CUDA runtime can track. If host memory allocated this way is involved in cudaMemcpy as either source or destination, the CUDA runtime will be able to perform an optimized memory transfer.
allocate mapped memory: this is also page-locked memory that can be used in kernel code directly as it is mapped to CUDA address space. To do this you have to set the cudaDeviceMapHost flag using cudaSetDeviceFlags before using any other CUDA function. The GPU memory size does not limit the size of mapped host memory.
I'm not sure about the performance of the latter technique. It could allow you to overlap computation and communication very nicely.
If you access the memory in blocks inside your kernel (i.e. you don't need the entire data but only a section) you could use a multi-buffering method utilizing asynchronous memory transfers with cudaMemcpyAsync by having multiple-buffers on the GPU: compute on one buffer, transfer one buffer to host and transfer one buffer to device at the same time.
I believe your assertions about the usage scenario are correct when using cudaDeviceMapHost type of allocation. You do not have to do an explicit copy but there certainly will be an implicit copy that you don't see. There's a chance it overlaps nicely with your computation. Note that you might need to synchronize the kernel call to make sure the kernel finished and that you have the modified content in h_p.

Using host memory would be orders of magnitude slower than on-device memory. It has both very high latency and very limited throughput. For example capacity of PCIe x16 is mere 8GB/s when bandwidth of device memory on GTX460 is 108GB/s

Neither the CUDA C Programming Guide, nor the CUDA Best Practices Guide mention that the amount allocated by cudaMallocHost can 't be bigger than the device memory so I conclude it's possible.
Data transfers from page locked memory to the device are faster than normal data transfers and even faster if using write-combined memory. Also, the memory allocated this way can be mapped into device memory space eliminating the need to (manually) copy the data at all. It happens automatic as the data is needed so you should be able to process more data than fits into device memory.
However, system performance (of the host) can greatly suffer, if the page-locked amount makes up a significant part of the host memory.
So when to use this technique?, simple: If the data needs be read only once and written only once, use it. It will yield a performance gain, since one would've to copy data back and forth at some point anyway. But as soon as the need to store intermediate results, that don't fit into registers or shared memory, arises, process chunks of your data that fit into device memory with cudaMalloc.

Yes, you can cudaMallocHost more space than there is on the gpu.
Pinned memory can have higher bandwidth, but can decrease host performance. It is very easy to switch between normal host memory, pinned memory, write-combined memory, and even mapped (zero-copy) memory. Why don't you use normal host memory first and compare the performance?
Yes, your usage scenario should work.
Keep in mind that global device memory access is slow, and zero-copy host memory access is even slower. Whether zero-copy is right for you depends entirely on how you use the memory.

Also consider use of streams for overlapping data transfer/ kernel execution.
This provides gpu work on chunks of data

Staying away from virtual memory in Windows\C++

I'm writing a performance critical application where its essential to store as much data as possible in the physical memory before dumping to disc.
I can use ::GlobalMemoryStatusEx(...) and ::GetProcessMemoryInfo(...) to find out what percentage of physical memory is reserved\free and how much memory my current process handles.
Using this data I can make sure to dump when ~90% of the physical memory is in use or ~90 of the maximum of 2GB per application limit is hit.
However, I would like a method for simply recieving how many bytes are actually left before the system will start using the virtual memory, especially as the application will be compiled for both 32bit and 64bit, whereas the 2 GB limit doesnt exist.

How about this function:
int
bytesLeftUntilVMUsed() {
return 0;
}
it should give the correct result in nearly all cases I think ;)

Imagine running Windows 7 in 256Mb of RAM (MS suggest 1GB minimum). That's effectively what you're asking the user to do by wanting to reseve 90% of available RAM.
The real question is: Why do you need so much RAM? What is the 'performance critical' criteria exactly?
Usually, this kind of question implies there's something horribly wrong with your design.
Update:
Using top of the range RAM (DDR3) would give you a theoretical transfer speed of 12GB/s which equates to reading one 32 bit value every clock cycle with some bandwidth to spare. I'm fairly sure that it is not possible to do anything useful with the data coming into the CPU at that speed - instruction processing stalls would interrupt this flow. The extra, unsued bandwidth can be used to page data to/from a hard disk. Using RAID this transfer rate can be quite high (about 1/16th of the RAM bandwidth). So it would be feasible to transfer data to/from the disk and process it without having any degradation of performance - 16 cycles between reads is all it would take (OK, my maths might be a bit wrong here).
But if you throw Windows into the mix, it all goes to pot. Your memory can go away at any moment, your application can be paused arbitrarily and so on. Locking memory to RAM would have adverse affects on the whole system, thus defeating the purpose of locing the memory.
If you explain what you're trying to acheive and the performance critria, there are many people here that will help develop a suitable solution, because if you have to ask about system limits, you really are doing something wrong.

Even if you're able to stop your application from having memory paged out to disk, you'll still run into the problem that the VMM might be paging out other programs to disk and that might potentially affect your performance as well. Not to mention that another application might start up and consume memory that you're currently occupying and thus resulting in some of your applications memory being paged out. How are you planning to deal with that?
There is a way to use non-pageable memory via the non-paged pool but (a) this pool is comparatively small and (b) it's used by device drivers and might only be usable from inside the kernel. It's also not really recommended to use large chunks of it unless you want to make sure your system isn't that stable.
You might want to revisit the design of your application and try to work around the possibility of having memory paged to disk before you either try to write your own VMM or turn a Windows machine into essentially a DOS box with more memory.

The standard solution is to not worry about "virtual" and worry about "dynamic".
The "virtual" part of virtual memory has to be looked at as a hardware function that you can only defeat by writing your own OS.
The dynamic allocation of objects, however, is simply your application program's design.
Statically allocate simple arrays of the objects you'll need. Use those arrays of objects. Increase and decrease the size of those statically allocated arrays until you have performance problems.

Ouch. Non-paged pool (the amount of RAM which cannot be swapped or allocated to processes) is typically 256 MB. That's 12.5% of RAM on a 2GB machine. If another 90% of physical RAM would be allocated to a process, that leaves either -2,5% for all other applications, services, the kernel and drivers. Even if you'd allocate only 85% for your app, that would still leave only 2,5% = 51 MB.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js