What happens when RDMA operates over mmaped region?

What happens when RDMA operates over mmaped region? - c++

RDMA is an efficient way to bypass the useless data copies between application and OS kernel. Mmap is an efficient way to deal with a large file as if it is just an array of bytes.
I am working with MPI over Infiniband which supports for RDMA network operations between processes. Each MPI process has a very large file to share with others.
Can each MPI process make mmap region over each large file and share it with others? I want to allow each process to read any file of any process as if it reads their memory via RDMA (MPI's one-sided communication).
As far as I know, when an application invokes RDMA operation, it passes the 'virtual memory address' to NIC directly. NIC will deal with the translation from the virtual memory address to its physical memory address. If RDMA driver pins the page of the interests before it issues the request to NIC, I think it will work. Is there any one who have any experience on it? :D
thanks

Yes, you can register with RDMA a region of memory mapped using mmap(). See the docs for ibv_reg_mr (http://www.rdmamojo.com/2012/09/07/ibv_reg_mr/) which say:
Every memory address in the virtual space of the calling process can be registered, including, but not limited to:
Local memory (either variable or array)
Global memory (either variable or array)
Dynamically allocated memory (using malloc() or mmap())

Related

Need to manage a slab of 'theoretical' memory

I need to manage memory that's in a separate memory space. Another program has a large slab of contiguous memory that my code cannot directly access, and notifies my code of its size during initialization. The other program will ask my code to "allocate" X bytes from the slab, and will later notify my code to deallocate the allocated blocks. The allocations and deallocations will be more or less unpredictable, much like the use of regular malloc and free. It's up to my code to manage the dynamic memory for the other process. The intended use has to do with device memory on a GPU, but I would rather not make the question specific to that. I'd like an answer that's generic, maybe even enough that I could theoretically use it as a back-end for a network API for managing virtual memory on a remote machine.
The basic functionality I need is to be able initialize the manager with the slab size, perform the equivalent of malloc() and free(), and get a few stats like remaining available memory and maybe maximum allocatable size.
Obviously, I can't use malloc() and free() 'abstractly'. I also don't want the management of my actual process memory to interfere with the management of my abstract slab.
What would be the best way to go about this? And, more specifically, does the standard library or Boost have this kind of a facility?
Notes:
Allocation strategy is not extremely important at this point. I mean, maybe it is, but first I need to figure out how I'm going to go about this business.
I'm going to be using this for allocating between tens and several thousand buffers per second, with different sizes; some as small as several bytes, some as large as gigabytes. There will be no buildup of allocated space over a long period of time.

a back-end for a network API for managing virtual memory on a remote machine
A networked system should not be concerned about the memory management of a remote node. The remote node should be free to choose whether to serve the request by allocating memory on the fastest but scarcest memory or in a larger but slower storage or a hybrid of both depending on its own heuristics, other requests happening concurrently, prioritization/security policy set by the network admin, and so on.
Thinking in terms of malloc and free will limit what a remote server can do to serve the requests efficiently.
Note that strictly speaking malloc itself is only managing "theoretical"/"virtual" memory. The OS kernel can page memory allocated with malloc in and out of swap space and assign or move its physical address on the MMU. The memory "address" returned by "malloc" is actually just a virtual address which doesn't correspond to actual physical address.
Unlike networked system, when managing hardware you do usually want to dictate very closely what the hardware should do on each step, because, presumably, a hardware is only connected to a single system, so any contention is only local contention, and considerations regarding performance throughput usually trumps everything else. The nature of managing remote resource on a networked system and local hardware is similar but quite different.
If you're interested in designs for networked shared memory, you might want to checkout in-memory databases (e.g. Redis) or networked file systems. A networked shared memory can usually afford to use a user-friendly string keys to point to the memory slabs; hardware shared memory usually want to keep it simple and fast by using simple numeric handles.
Both networked and hardware need to deal with splitting a single resource for multiple independent processes securely. In many popular operating system, the code responsible for managing hardware allocation is usually the kernel. It's quite rare for userspace program to be the one responsible for managing hardware on monolithic OSes. Maybe you should be writing a device driver? Many GPGPU device now have an MMU.
Certainly, there's nothing preventing you from writing a gpu_malloc()/gpu_free(). It theoretically should be possible to mmap the remote system's address space into the process's virtual memory address space so that programs could just use the remote memory just as any other memory.
I'm going to be using this for allocating between tens and several thousand buffers per second, with different sizes; some as small as several bytes, some as large as gigabytes.
Generally, you don't want having thousands of small allocations. System calls that relates to memory management can be expensive if it involves remote systems, so it might be necessary to have the client program to allocate large slabs and do its own suballocation (malloc does this, small mallocs usually causes one system call to allocate larger memory than is requested and malloc then suballocates the slab).

Implementing user-space network card "bus-mastering" in C++ on Linux

I am interested in accessing network packets via "bus-mastering" in a C++ application on Linux. I have a few questions relating to this overall topic:
1) How will I know which memory address range the "bus-mastering"-enabled Network card is writing the data to and would this be kernel or user space?
2) If #2 is "Kernel space", how could I change the card so that it writes to memory in user space?
3a) How can I access this particular user-space memory area from C++?
3b) I understand you cannot just start accessing memory areas of other processes from one application, only those explicitly "shared"- so how do I ensure the memory area written to directly by the network card is explicitly for sharing?
4) How do I know whether a network card implements "bus-mastering"?
I have come across the term PACKET_MMAP - is this going to be what I need?

If you mmap a region of memory, and give the address of that to the OS, the OS can lock that region (so that it doesn't become swapped out) and get the physical address of the memory.
It is not at all used for that purpose, but the code in drivers/xen/privcmd.c, in the function mmap_mfn_range called from privcmd_ioctl_mmap (indirectly, by traverse_map). This in turn calls remap_area_mfn_pte_fn from xen_remap_domain_mfn_range.
So, if you do something along those lines in the driver, such that the pages are locked into memory and belong to the application, you can program the physical address(es) of the mmap'd region into the hardware of the network driver, and get the data directly to the user-mode memory that was mmap'd by the user code.

Why is CUDA pinned memory so fast?

I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;
In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.

CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).
As described here http://forums.nvidia.com/index.php?showtopic=164661
host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.
I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:
The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().

CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk.
If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA.
So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.

If the memory pages had not been accessed yet, they were probably never swapped in to begin with. In particular, newly allocated pages will be virtual copies of the universal "zero page" and don't have a physical instantiation until they're written to. New maps of files on disk will likewise remain purely on disk until they're read or written.

A verbose note on copying non-locked pages to locked pages.
It could be extremely expensive if non-locked pages are swapped out by OS on a busy system with limited CPU RAM. Then page fault will be triggered to load pages into CPU RAM through expensive disk IO operations.
Pinning pages can also cause virtual memory thrashing on a system where CPU RAM is precious. If thrashing happens, the throughput of CPU can be degraded a lot.

map file in to the ram

Platofrm - Linux, Arch - ARM
Programming lang - C/C++
Objective - map a regular (let say text) file to a pre-known location (physical address) in ram and pass that physical address to some other application. Size of the block which i map at a time is 128K.
The way I am trying to go about is-
User space process issues the ioctl call to ask a device driver to get a chunk of memory (ram), calculated the physical address and return it to the user space.
User space process needs to maps the file to that physical address space
I am not sure how to go about it. Any help is appreciated. ???
Issue with mmap call on the file and then calculating the physical address is that, pages are not in memory till someone access them and physical memory pages allocated might not be contiguous.
The other process which will actually access the file is from third party vendor application. That application demands once we pass it the physical address, file contents needs to be present in contiguous memory.
How i am doing it right now --
User process call the mmap to device.
Device driver does a kmalloc, calculate the starting physical address and mmap the VMA to that physical address.
Now user process do a read on the file and copies it to the address space obtained during the mmap.
Issue - Copy of the file exist two location in the ram, one when read is done from disk and other when i copy it to the buffer obtained using mmap and corresponding copying overheads.
In a ideal world i would like to load the file directly from the disk to a known/predefined location.

"Mapping a file" implies using virtual addresses rather than physical, so that's not going to do what you want.
If you want to put the file contents into a contiguous block of physical memory, just use open() and read() once you have obtained the contiguous buffer.

Perhaps something like madvise() with MADV_SEQUENTIAL advice argument could help?
Some things to consider:
How large is the file you're going to be mapping?
That might affect your ability to get a contiguous block of RAM, even if you were to take the kernel driver based approach.
For a kernel driver based approach, well behaved drivers typically should not kmalloc(), e.g. to get a contiguous block of memory, more than 32KB. Furthermore, you typically can't kmalloc() more than 2MB (I've tried this :)). Is that going to be suitable for your needs?
If you need a really large chunk of memory something like the kernel's alloc_bootmem() function could help, but it only works for static "built-in" drivers, not dynamically loadable ones.
Is there any way you can rework your design so that a large contiguous block of mapped memory isn't necessary?

How to translate a virtual memory address to a physical address?

In my C++ program (on Windows), I'm allocating a block of memory and can make sure it stays locked (unswapped and contiguous) in physical memory (i.e. using VirtualAllocEx(), MapUserPhysicalPages() etc).
In the context of my process, I can get the VIRTUAL memory address of that block,
but I need to find out the PHYSICAL memory address of it in order to pass it to some external device.
1. Is there any way I can translate the virtual address to the physical one within my program, in USER mode?
2. If not, I can find out this virtual to physical mapping only in KERNEL mode. I guess it means I have to write a driver to do it...? Do you know of any readily available driver/DLL/API which I can use, that my application (program) will interface with to do the translation?
3. In case I'll have to write the driver myself, how do I do this translation? which functions do I use? Is it mmGetPhysicalAddress()? How do I use it?
4. Also, if I understand correctly, mmGetPhysicalAddress() returns the physical address of a virtual base address that is in the context of the calling process. But if the calling process is the driver, and I'm using my application to call the driver for that function, I'm changing contexts and I am no longer in the context of the app when the mmGetPhysicalAddress routine is called... so how do I translate the virtual address in the application (user-mode) memory space, not the driver?
Any answers, tips and code excerpts will be much appreciated!!
Thanks

In my C++ program (on Windows), I'm allocating a block of memory and can make sure it stays locked (unswapped and contiguous) in physical memory (i.e. using VirtualAllocEx(), MapUserPhysicalPages() etc).
No, you can't really ensure that it stays locked. What if your process crashes, or exits early? What if the user kills it? That memory will be reused for something else, and if your device is still doing DMA, that will eventually result in data loss/corruption or a bugcheck (BSOD).
Also, MapUserPhysicalPages is part of Windows AWE (Address Windowing Extensions), which is for handling more than 4 GB of RAM on 32-bit versions of Windows Server. I don't think it was intended to be used to hack up user-mode DMA.
1. Is there any way I can translate the virtual address to the physical one within my program, in USER mode?
There are drivers that let you do this, but you cannot program DMA from user mode on Windows and still have a stable and secure system. Letting a process that runs as a limited user account read/write physical memory allows that process to own the system. If this is for a one-off system or a prototype, this is probably acceptable, but if you expect other people (particularly paying customers) to use your software and your device, you should write a driver.
2. If not, I can find out this virtual to physical mapping only in KERNEL mode. I guess it means I have to write a driver to do it...?
That is the recommended way to approach this problem.
Do you know of any readily available driver/DLL/API which I can use, that my application (program) will interface with to do the translation?
You can use an MDL (Memory Descriptor List) to lock down arbitrary memory, including memory buffers owned by a user-mode process, and translate its virtual addresses into physical addresses. You can also have Windows temporarily create an MDL for the buffer passed into a call to DeviceIoControl by using METHOD_IN_DIRECT or METHOD_OUT_DIRECT.
Note that contiguous pages in the virtual address space are almost never contiguous in the physical address space. Hopefully your device is designed to handle that.
3. In case I'll have to write the driver myself, how do I do this translation? which functions do I use? Is it mmGetPhysicalAddress()? How do I use it?
There's a lot more to writing a driver than just calling a few APIs. If you're going to write a driver, I would recommend reading as much relevant material as you can from MSDN and OSR. Also, look at the examples in the Windows Driver Kit.
4. Also, if I understand correctly, mmGetPhysicalAddress() returns the physical address of a virtual base address that is in the context of the calling process. But if the calling process is the driver, and I'm using my application to call the driver for that function, I'm changing contexts and I am no longer in the context of the app when the mmGetPhysicalAddress routine is called... so how do I translate the virtual address in the application (user-mode) memory space, not the driver?
Drivers are not processes. A driver can run in the context of any process, as well as various elevated contexts (interrupt handlers and DPCs).

You have a virtually continguous buffer in your application. That range of virtual memory is, as you noted, only available in the context of your application and some of it may be paged out at any time. So, in order to do access the memory from a device (which is to say, do DMA) you need to both lock it down and get a description that can be passed to a device.
You can get a description of the buffer called an MDL, or Memory Descriptor List, by sending an IOCTL (via the DeviceControl function) to your driver using METHOD_IN_DIRECT or METHOD_OUT_DIRECT. See the following page for a discussion of defining IOCTLs.
http://msdn.microsoft.com/en-us/library/ms795909.aspx
Now that you have a description of the buffer in a driver for your device, you can lock it down so that the buffer remains in memory for the entire period that your device may act on it. Look up MmProbeAndLockPages on MSDN.
Your device may or may not be able to read or write all of the memory in the buffer. The device may only support 32-bit DMA and the machine may have more than 4GB of RAM. Or you may be dealing with a machine that has an IOMMU, a GART or some other address translation technology. To accomodate this, use the various DMA APIs to get a set of logical addresses that are good for use by your device. In many cases, these logical addresses will be equivalent to the physical addresses that your question orginally asked about, but not always.
Which DMA API you use depends on whether your device can handle scatter/gather lists and such. Your driver, in its setup code, will call IoGetDmaAdapter and use some of the functions returned by it.
Typically, you'll be interested in GetScatterGatherList and PutScatterGatherList. You supply a function (ExecutionRoutine) which actually programs your hardware to do the transfer.
There's a lot of details involved. Good Luck.

You can not access the page tables from user space, they are mapped in the kernel.
If you are in the kernel, you can simply inspect the value of CR3 to locate the base page table address and then begin your resolution.
This blog series has a wonderful explanation of how to do this. You do not need any OS facility/API to resolve virtual<->physical addresses.
Virtual Address: f9a10054
1: kd> .formats 0xf9a10054
Binary: 11111001 10100001 00000000 01010100
Page Directory Pointer Index(PDPI) 11 Index into
1st table(Page Directory Pointer
Table) Page Directory Index(PDI)
111001 101 Index into 2nd
table(Page Directory Table) Page
Table Index(PTI)
00001 0000 Index into 3rd
table(Page Table) Byte Index
0000 01010100 0x054, the offset
into the physical memory page
In his example, they use windbg, !dq is a physical memory read.

1) No
2) Yes, you have to write a driver. Best would be either a virtual driver, or change the driver for the special-external device.
3) This gets very confusing here. MmGetPhysicalAddress should be the method you are looking for, but I really don't know how the physical address is mapped to the bank/chip/etc. on the physical memory.
4) You cannot use paged memory, because that gets relocated. You can lock paged memory with MmProbeAndLockPages on an MDL you can build on memory passed in from the user mode calling context. But it is better to allocate non-paged memory and hand that to your user mode application.
PVOID p = ExAllocatePoolWithTag( NonPagedPool, POOL_TAG );
PHYSICAL_ADDRESS realAddr = MmGetPhysicalAddress( p );
// use realAddr

You really shouldn't be doing stuff like this in usermode; as Christopher says, you need to lock the pages so that mm doesn't decide to page out your backing memory while a device is using it, which would end up corrupting random memory pages.
But if the calling process is the driver, and I'm using my application to call the driver for that function, I'm changing contexts and I am no longer in the context of the app when the mmGetPhysicalAddress routine is called
Drivers don't have context like user-mode apps do; if you're calling into a driver via an IOCTL or something, you are usually (but not guaranteed!) to be in the calling user thread's context. But really, this doesn't matter for what you're asking, because kernel-mode memory (anything above 0x80000000) is the same mapping no matter where you are, and you'd end up allocating memory in the kernel side. But again, write a proper driver. Use WDF (http://www.microsoft.com/whdc/driver/wdf/default.mspx), and it will make writing a correct driver much easier (though still pretty tricky, Windows driver writing is not easy)
EDIT: Just thought I'd throw out a few book references to help you out, you should definitely (even if you don't pursue writing the driver) read Windows Internals by Russinovich and Solomon (http://www.amazon.com/Microsoft-Windows-Internals-4th-Server/dp/0735619174/ref=pd_bbs_sr_2?ie=UTF8&s=books&qid=1229284688&sr=8-2); Programming the Microsoft Windows Driver Model is good too (http://www.amazon.com/Programming-Microsoft-Windows-Driver-Second/dp/0735618038/ref=sr_1_1?ie=UTF8&s=books&qid=1229284726&sr=1-1)

Wait, there is more. For the privilege of runnning on your customer's Vista 64 bit, you get expend more time and money to get your kernal mode driver resigned my Microsoft,

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js