Why is CUDA pinned memory so fast? - c++

I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;
In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).
As described here http://forums.nvidia.com/index.php?showtopic=164661
host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.
I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:
The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk.
If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA.
So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.

If the memory pages had not been accessed yet, they were probably never swapped in to begin with. In particular, newly allocated pages will be virtual copies of the universal "zero page" and don't have a physical instantiation until they're written to. New maps of files on disk will likewise remain purely on disk until they're read or written.

A verbose note on copying non-locked pages to locked pages.
It could be extremely expensive if non-locked pages are swapped out by OS on a busy system with limited CPU RAM. Then page fault will be triggered to load pages into CPU RAM through expensive disk IO operations.
Pinning pages can also cause virtual memory thrashing on a system where CPU RAM is precious. If thrashing happens, the throughput of CPU can be degraded a lot.

Related

File Based Memory Pool - Is it Possible?

Whenever a new / malloc is used, OS create a new(or reuse) heap memory segment, aligned to the page size and return it to the calling process. All these allocations will constitute to the Process's virtual memory. In 32bit computing, any process can scale only upto 4 GB. Higher the heap allocation, the rate of increase of process memory is higher. Though there are lot of memory management / memory pools available, all these utilities end up again in creating a heap and reusing it effeciently.
mmap (Memory mapping) on the other hand, provides the ablity to visualize a file as memory stream, and enables the program to use pointer manipulations directly on file. But here again, mmap is actually allocating the range of addresses in the process space. So if we mmap a 3GB file with size 3GB and take a pmap of the process, you could see the total memory consumed by the process is >= 3GB.
My question is, is it possible to have a file based memory pool [just like mmaping a file], however, does not constitute the process memory space. I visualize something like a memory DB, which is backed by a file, which is so fast for read/write, which supports pointer manipulations [i.e get a pointer to the record and store anything as if we do using new / malloc], which can grow on the disk, without touching the process virtual 4GB limit.
Is it possible ? if so, what are some pointers for me to start working.
I am not asking for a ready made solution / links, but to conceptually understand how it can be achieved.
It is generally possible but very coplicated. You would have to re-map if you wanted to acces different 3Gb segments of your file, which would probably kill the performance in case of scattered access. Pointers would only get much more difficult to work with, as remmpaing changes data but leaves the adresses the same.
I have seen STXXL project that might be interesting to you; or it might not. I have never used it so I cannot give you any other advice about it.
What you are looking for, is in principle, a memory backed file-cache. There are many such things in for example database implementations (where the whole database is way larger than the memory of the machine, and the application developer probably wants to have a bit of memory left for application stuff). This will involve having some sort of indirection - an index, hash or some such to indicate what area of the file you want to access, and using that indirection to determine if the memory is in memory or on disk. You would essentially have to replicate what the virtual memory handling of the OS and the processor does, by having tables that indicate where in physical memory your "virtual heap" is, and if it's not present in physical memory, read it in (and if the cache is full, get rid of some - and if it's been written, write it back again).
However, it's most likely that in today's world, you have a machine capable of 64-bit addressing, and thus, it would be much easier to recompile the application as a 64-bit application, usemmap or similar to access the large memory. In this case, even if RAM isn't sufficient, you can access the memory of the file via the virtual memory system, and it takes care of all the mapping back and forth between disk and RAM (physical memory).

Will memory paging occur if I have plenty of RAM left?

I'm writing a high-performance server application (on Linux) and I'm trying to get a fast critical path. I'm concerned about memory paging and having memory swapped to disk (latency on order of milliseconds) during my operations.
My question is if I have a lot of memory on the server (say 16GB) and my memory utilization stays at around 6-10GB and I know there are no other processes on the same box. Can I be guaranteed not to have page misses after the application is started up and warm?
This is not guaranteed. Linux's default behavior is to sometimes use RAM to cache files, which can improve performance for some workflows. This means that sometimes memory pages will be swapped out even when the memory isn't all used.
You can use mlock/mlockall to lock the process's pages in memory. See man 2 mlock for more information.

Is it true that when memory is allocated into a process in Windows, it always triggers a page fault?

I'm trying to wrap my head around the internals of Windows memory management at the OS level.
Is it true that when memory is allocated, it always triggers a page fault behind the scenes? Does this imply that the only way to stop soft page faults is to stop allocating new memory within the process?
Definitions
I define "memory allocation" as any form of malloc, i.e. new, LocalAlloc, VirtualAlloc, HeapAlloc, etc.
I define a "page fault" as the process of mapping memory from the OS pool into the process Working Set, an operation which takes a constant 250us on a high end Xeon.
You need to be very clear about the different things that are going on here. There are two independent parts to the process, committing the memory and paging the memory into the process. Neither of these are related to calling malloc, HeapAlloc or LocalAlloc.
I've tried to break the process down for you below, but the summary is that if you use HeapAlloc or another equivalent function then you'll trigger very few page-faults (at least once your application has initialized and the heap has grown to a stable size) and so shouldn't worry about it too much.
Allocating memory
When you call malloc, HeapAlloc or LocalAlloc the memory allocator will try to find a piece of memory in the heap that's available and large enough. In the majority of cases it will succeed and return the memory to you.
If it can not find sufficient memory it will allocate more by calling VirtualAlloc (on Linux this would be sbrk or mmap). This commits the memory. It will return a small fragment of the new memory to you.
Memory commitment
When you, or the allocator, call VirtualAlloc this will mark a new region of your virtual memory as accessible. This does not trigger a page-fault, nor does it actually assign physical memory to those pages. From the MSDN docs for VirtualAlloc:
Allocates memory charges (from the overall size of memory and the paging files on disk)
for the specified reserved memory pages. The function also guarantees that when the caller
later initially accesses the memory, the contents will be zero. Actual physical pages are
not allocated unless/until the virtual addresses are actually accessed.
Paging memory in
When you access a page of memory that VirtualAlloc has returned to you for the first time this triggers a soft page-fault. The operating system will find a single page of free physical memory, zero it out and assign it to the virtual page you accessed. This is transparent to you and takes very little time (a single-digit number of microseconds). It's plausible that the operating system may swap this memory out to disk if you stop using it, if it does then a subsequent access will trigger a hard page-fault.
Well, yes, a page fault will bring the CPU back into kernel mode, and the kernel gets a chance to learn that the userspace process wanted a certain memory page. The kernel can then consult its internal memory management bookkeeping data, make a suitably large area of physical memory available, and adjust the processor's page table so that the requested virtual address is mapped. Once that's done, execution passes back to the userspace process, which resumes with a successful allocation.
This is not to be confused with an unexpected page fault, by which the userspace process refers to an address that is neither mapped in the page tables nor known to the kernel's memory manager to belong to the process. In that case, the kernel will kill the rogue process.

CUDA Zero Copy memory considerations

I am trying to figure out if using cudaHostAlloc (or cudaMallocHost?) is appropriate.
I am trying to run a kernel where my input data is more than the amount available on the GPU.
Can I cudaMallocHost more space than there is on the GPU? If not, and lets say I allocate 1/4 the space that I need (which will fit on the GPU), is there any advantage to using pinned memory?
I would essentially have to still copy from that 1/4 sized buffer into my full size malloc'd buffer and that's probably no faster than just using normal cudaMalloc right?
Is this typical usage scenario correct for using cudaMallocHost:
allocate pinned host memory (lets call it "h_p")
populate h_p with input data-
get device pointer on GPU for h_p
run kernel using that device pointer to modify contents of array-
use h_p like normal, which now has modified contents-
So - no copy has to happy between step 4 and 5 right?
if that is correct, then I can see the advantage for kernels that will fit on the GPU all at once at least
Memory transfer is an important factor when it comes to the performance of CUDA applications. cudaMallocHost can do two things:
allocate pinned memory: this is page-locked host memory that the CUDA runtime can track. If host memory allocated this way is involved in cudaMemcpy as either source or destination, the CUDA runtime will be able to perform an optimized memory transfer.
allocate mapped memory: this is also page-locked memory that can be used in kernel code directly as it is mapped to CUDA address space. To do this you have to set the cudaDeviceMapHost flag using cudaSetDeviceFlags before using any other CUDA function. The GPU memory size does not limit the size of mapped host memory.
I'm not sure about the performance of the latter technique. It could allow you to overlap computation and communication very nicely.
If you access the memory in blocks inside your kernel (i.e. you don't need the entire data but only a section) you could use a multi-buffering method utilizing asynchronous memory transfers with cudaMemcpyAsync by having multiple-buffers on the GPU: compute on one buffer, transfer one buffer to host and transfer one buffer to device at the same time.
I believe your assertions about the usage scenario are correct when using cudaDeviceMapHost type of allocation. You do not have to do an explicit copy but there certainly will be an implicit copy that you don't see. There's a chance it overlaps nicely with your computation. Note that you might need to synchronize the kernel call to make sure the kernel finished and that you have the modified content in h_p.
Using host memory would be orders of magnitude slower than on-device memory. It has both very high latency and very limited throughput. For example capacity of PCIe x16 is mere 8GB/s when bandwidth of device memory on GTX460 is 108GB/s
Neither the CUDA C Programming Guide, nor the CUDA Best Practices Guide mention that the amount allocated by cudaMallocHost can 't be bigger than the device memory so I conclude it's possible.
Data transfers from page locked memory to the device are faster than normal data transfers and even faster if using write-combined memory. Also, the memory allocated this way can be mapped into device memory space eliminating the need to (manually) copy the data at all. It happens automatic as the data is needed so you should be able to process more data than fits into device memory.
However, system performance (of the host) can greatly suffer, if the page-locked amount makes up a significant part of the host memory.
So when to use this technique?, simple: If the data needs be read only once and written only once, use it. It will yield a performance gain, since one would've to copy data back and forth at some point anyway. But as soon as the need to store intermediate results, that don't fit into registers or shared memory, arises, process chunks of your data that fit into device memory with cudaMalloc.
Yes, you can cudaMallocHost more space than there is on the gpu.
Pinned memory can have higher bandwidth, but can decrease host performance. It is very easy to switch between normal host memory, pinned memory, write-combined memory, and even mapped (zero-copy) memory. Why don't you use normal host memory first and compare the performance?
Yes, your usage scenario should work.
Keep in mind that global device memory access is slow, and zero-copy host memory access is even slower. Whether zero-copy is right for you depends entirely on how you use the memory.
Also consider use of streams for overlapping data transfer/ kernel execution.
This provides gpu work on chunks of data

Does malloc/new return memory blocks from Cache or RAM?

I wanted to know whether malloc/new returns memory blocks from Cache or RAM.
Thanks in advance.
You are abstracted of all that when living as a process in the OS, you only get memory.
You shouldn't worry ever about that, the OS will manage all that for you and the memory unit will move things from one to another. But you still see a single virtual memory layout.
From virtual memory. OS will take care of bringing the required pages into the RAM whenever the process requires it.
malloc and operator new will give you a chunk of address space.
The operating system will back this chunk of address space with some physical storage. The storage could be system memory or a chunk of a page file and the actual storage location can be moved between between the various physical storage devices and this is handled transparently from the application point of view. In addition the CPU and memory controller (on board or otherwise) may cache system memory but this is usually (largely) transparent to the operating system.
As already said you can't know. The cache/RAM/hard disk is abstracted as virtual memory. But I think if you can measure the access time you may get an idea whether the RAM or cache is being accessed. But after the first access to RAM the memory block will be copied to the cache and subsequent accesses will be served from the cache.
It very much depends. At the start of your program, the memory that the OS gives you will probably not be paged in (at least, not on Linux). However, if you free some stuff, then get a new allocation, the memory could possibly be in the cache.
In there is a constructor which touches the memory, then it will certainly be in the cache after it's constructed.
If you're programming in Java, then you'll likely have a really cool memory allocator, and much more likely to be given memory thats in the cache.
(When I say cache, I mean L2. L1 is unlikely, but you might get lucky, esp. for small programs).
You cannot address the processor cache directly, the processor manages it (almost) transparently... At most you can invalidate or prefetch a cache line; but you access memory addresses (usually virtual if you're not in real mode), and the processor will be feed always data and instructions from its internal memory (if the data it's not already present, then it needs to be fetched).
Read this article for further info: http://en.wikipedia.org/wiki/Memory_hierarchy
At first, the memory allocated for application is a virtual memory, whose address is located in the virtual space. Secondly, such as L1 and L2 cache will not be allocated for you, which is managed by system. In fact, if cache are allocated for you ,it's hard for the system to dispatch tasks.