I am using a board with integrated gpu and cpu memory. I am also using an external matrix library (Blitz++). I would like to be able to grab the pointer to my data from the matrix object and pass it into a cuda kernel. After doing some digging, it sounds like I want to use some form of a zero copy by calling cudaHostGetDevicePointer. What I am unsure of is the allocation of the memory. Do I have to have created the pointer using cudaHostAlloc? I do not want to have to re-write Blitz++ to do cudaHostAlloc if I don't have to.
My code currently works, but does a copy of the matrix data every time. That is not needed on the integrated memory cards.
The pointer has to be created (i.e. allocated) with cudaHostAlloc, even on integrated systems like Jetson.
The reason for this is that the GPU requires (zero-copy) memory to be pinned, i.e. removed from the host demand-paging system. Ordinary allocations are subject to demand-paging, and may not be used as zero-copy i.e. mapped memory for the GPU.
Related
Can someone give a clear explanation of how the new and delete keywords would behave if called from __device__ or __global__ code in CUDA 4.2?
Where does the memory get allocated, if its on the device is it local or global?
It terms of context of the problem I am trying to create neural networks on the GPU, I want a linked representation (Like a linked list, but each neuron stores a linked list of connections that hold weights, and pointers to the other neurons), I know I could allocate using cudaMalloc before the kernel launch but I want the kernel to control how and when the networks are created.
Thanks!
C++ new and delete operate on device heap memory. The device allows for a portion of the global (i.e. on-board) memory to be allocated in this fashion. new and delete work in a similar fashion to device malloc and free.
You can adjust the amount of device global memory available for the heap using a runtime API call.
You may also be interested in the C++ new/delete sample code.
CC 2.0 or greater is required for these capabilities.
I have a C++ project which uses DLL in C++ with CUDA.
Now I'm passing 2 pointers from the main project to DLL. Inside DLL, the arrays will be copy to device memory. Some calculation will be done with them. And then, the arrays will be copy back to host.
I heard that the data transfering will be better with data overlap method. But how can I do it in this case? The copy function cudaMemcpyAsync requires pinned memory to be asynchronous and the passed pointers are not, right?
I'm using the temporary solution is to use memcpy to copy passed arrays to pinned arrays. Then, I use streams to overlap data. After that, use memcpy again to copy from pinned memory arrays to passed arrays. And the CPU stuff is clearly not a good way here, I think.
And can we do something like passing pinned memory arrays from main project to DLL when both are with CUDA?
Many thanks in advance.
The memory allocated by standard C/C++ allocators i.e. malloc and new can be converted to page locked memory by using the CUDA Runtime function cudaHostRegister, which can be used to overlap asynchronous memory copies b/w host and device. Be advised; don't forget to unpin the memory which has been pinned using the mentioned function. Use cudaHostUnregister to unpin the memory. If memory is not unpinned, undesired results may be produced. e.g. a function may try to pin the memory which has already been pinned. or pinned memory may be freed using free or delete which is undefined behavior.
I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations
If I have a special hardware unit with some storage in it is connected to the computer and is memory mapped, so that
its storage is accessible in the address range 0x55500000 – 0x555fffff how an I interface this hardware unit to
my C++ program so that dynamic memory is allocated in this hardware unit, not in my computer’s memory?
I need to implement a class which has the following function in it.
void * allocMemoryInMyHardware(int numberOfBytesToAllocate);
which returns a pointer to the allocated memory chunk, or null if unable to allocate.
You need to write your own allocator. Search internet for a sample code and tweak it. If you have simple requirements, basic allocator can be written from scratch in 2-4 hours. This approach will work if your platform does not have virtual memory management and code can access your range of addresses directly. Otherwise you need to dive into the driver development on your platform.
Typical strategy is to add header to each allocated unit and organize a double linked list for free memory areas. NT heaps work in similar way.
I think you can use the placement new syntax for this purpose. Using this, you can tell where objects shall be constructed:
char memory[10];
int* i = new (memory) int(42);
i am new to OpenGL.
My question is: what does glMapBuffer do behind the scenes? does it allocate a new host memory, copies the GL object data to it and and returns the pointer?
is it gauranteed to receive the same pointer for subsequent calls to this method? ofcourse with releasing in between.
Like so often, the answer is "It depends". In some situations glMapBuffer will indeed allocate memory through malloc, copy the data there for your use and glUnmapBuffer releases it.
However the common way to implement glMapBuffer, is through memory mapping. If you don't know what this is, take a look at the documentation of the syscalls mmap (*nix systems like Linux, MacOS X) or CreateFileMap. What happens there is kind of interesting: Modern operating systems manage the running processes' address space in virtual memory. Everytime some "memory" is accessed the OS' memory management uses the accessed address as an index into a translation table, to redirect the operation to system RAM, swap space, etc. (of course the details are quite involved, the memory management is one of the more difficult things in a kernel to understand). A driver can install its very own access handler. So a process can mmap something managed by a driver into its address space and everytime is performs an access on this, the driver's mmap handler gets called. This allows the driver to map some GPU memory (through DMA) into the process' address space and do the neccessary bookkeeping. In this situation glMapBuffer will create such a memory mapping and the pointer you recieve will point to some address space in your process, that has been mapped into DMA memory reserved for the GPU.
It could do any of those. The specific memory behavior of glMapBuffer is implementation-defined.
If you map a buffer for reading, it might give you a pointer to that buffer object's memory. Or it might allocate a pointer and copy it. If you map a buffer for writing, it might give you a pointer to that buffer object's memory. Or it might allocate a pointer and hand that to you.
There is no way to know; you can't rely on it doing either one. It ought to give you better performance than glBufferSubData, assuming you can generate your data (from a file or from elsewhere) directly into the memory you get back from glMapPointer. The worst-case would be equal performance to glBufferSubData.
is it gauranteed to receive the same pointer for subsequent calls to this method?
Absolutely not.