Can someone give a clear explanation of how the new and delete keywords would behave if called from __device__ or __global__ code in CUDA 4.2?
Where does the memory get allocated, if its on the device is it local or global?
It terms of context of the problem I am trying to create neural networks on the GPU, I want a linked representation (Like a linked list, but each neuron stores a linked list of connections that hold weights, and pointers to the other neurons), I know I could allocate using cudaMalloc before the kernel launch but I want the kernel to control how and when the networks are created.
Thanks!
C++ new and delete operate on device heap memory. The device allows for a portion of the global (i.e. on-board) memory to be allocated in this fashion. new and delete work in a similar fashion to device malloc and free.
You can adjust the amount of device global memory available for the heap using a runtime API call.
You may also be interested in the C++ new/delete sample code.
CC 2.0 or greater is required for these capabilities.
Related
Hell'o
I want to create my own dynamic array (vector) class, but don't know how to allocate memory on addres whom I point to. In function add I added line like:
int * object = new (this->beginning + this->lenght) int (paramValue); But visual studio shows me an error message "operator new cannot be called with the given arguments". How to make it works, which arguments should I send to the new operator?
(I am not sure to understand your question, but....)
You might want to use the placement new operator (but to implement a <vector> like thing you don't need that). Then you'll need to #include <new>
But you probably don't need that. Just call plain new from your constructor, and plain delete from your destructor. Something like int*arr = new int[length]; (in constructor) and later delete[] arr; (in destructor).
(it looks that you are misunderstanding something; I recommend spending several days reading a good C++ programming book)
how to allocate memory on address whom I point to
Insufficient information -- what kind of system? custom hardware? OS?
On a desktop, you could use 2 steps. You allocate a block of bytes using something like:
uint8_t* myMemoryBlock = new uint8_t[1000]; // 1000 byte block
Then you might contemplate using placement new at the address "you point to" using 'myMemoryBlock', with a cast.
On a desktop, the dynamic memory system can be used this way...
But if you are planning to create a user defined type any way, just new that type, and let the dynamic memory fall where it may, as opposed to positioning it on myMemoryBlock.
On a desktop, there is (generally) no memory your user-privilege level executable can access with 'new'. All other memory is protected.
mmap on Linux maps devices or files into your executables memory range. I am unfamiliar with such devices, but I have used mmap with files.
update 2017/03/19
Note 1 - user-privilege level tasks are typically blocked from accessing other / special memory.
Note 2 - memory addresses, such as 'myMemoryBlock' above, are virtual, not physical. This includes code addresses, automatic memory addresses, dynamic memory addresses. If your processor has memory management hardware support, your coding has special efforts to access physical addresses, in memory or otherwise.
On a single-board-computer (SBC), (with or without an OS) I would expect that the address you wish to 'allocate' will not be within the 'dynamic' memory set up by the board support package (BSP).
On this kind of embedded system (on a SBC), someone (an architect) has 'mapped' this 'special' memory to an address range not in use for other purposes (i.e. not part of dynamic memory). Here, you simply find out what the address is, and use it by casting the uintXX_t value to a pointer of appropriate type. Something like:
myDataType* p = reinterpret_cast<myDataType*>(premappedAddress);
For more info, you should seek out other sites discussing embedded systems.
I am using a board with integrated gpu and cpu memory. I am also using an external matrix library (Blitz++). I would like to be able to grab the pointer to my data from the matrix object and pass it into a cuda kernel. After doing some digging, it sounds like I want to use some form of a zero copy by calling cudaHostGetDevicePointer. What I am unsure of is the allocation of the memory. Do I have to have created the pointer using cudaHostAlloc? I do not want to have to re-write Blitz++ to do cudaHostAlloc if I don't have to.
My code currently works, but does a copy of the matrix data every time. That is not needed on the integrated memory cards.
The pointer has to be created (i.e. allocated) with cudaHostAlloc, even on integrated systems like Jetson.
The reason for this is that the GPU requires (zero-copy) memory to be pinned, i.e. removed from the host demand-paging system. Ordinary allocations are subject to demand-paging, and may not be used as zero-copy i.e. mapped memory for the GPU.
I have a C++ project which uses DLL in C++ with CUDA.
Now I'm passing 2 pointers from the main project to DLL. Inside DLL, the arrays will be copy to device memory. Some calculation will be done with them. And then, the arrays will be copy back to host.
I heard that the data transfering will be better with data overlap method. But how can I do it in this case? The copy function cudaMemcpyAsync requires pinned memory to be asynchronous and the passed pointers are not, right?
I'm using the temporary solution is to use memcpy to copy passed arrays to pinned arrays. Then, I use streams to overlap data. After that, use memcpy again to copy from pinned memory arrays to passed arrays. And the CPU stuff is clearly not a good way here, I think.
And can we do something like passing pinned memory arrays from main project to DLL when both are with CUDA?
Many thanks in advance.
The memory allocated by standard C/C++ allocators i.e. malloc and new can be converted to page locked memory by using the CUDA Runtime function cudaHostRegister, which can be used to overlap asynchronous memory copies b/w host and device. Be advised; don't forget to unpin the memory which has been pinned using the mentioned function. Use cudaHostUnregister to unpin the memory. If memory is not unpinned, undesired results may be produced. e.g. a function may try to pin the memory which has already been pinned. or pinned memory may be freed using free or delete which is undefined behavior.
I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations
If I have a special hardware unit with some storage in it is connected to the computer and is memory mapped, so that
its storage is accessible in the address range 0x55500000 – 0x555fffff how an I interface this hardware unit to
my C++ program so that dynamic memory is allocated in this hardware unit, not in my computer’s memory?
I need to implement a class which has the following function in it.
void * allocMemoryInMyHardware(int numberOfBytesToAllocate);
which returns a pointer to the allocated memory chunk, or null if unable to allocate.
You need to write your own allocator. Search internet for a sample code and tweak it. If you have simple requirements, basic allocator can be written from scratch in 2-4 hours. This approach will work if your platform does not have virtual memory management and code can access your range of addresses directly. Otherwise you need to dive into the driver development on your platform.
Typical strategy is to add header to each allocated unit and organize a double linked list for free memory areas. NT heaps work in similar way.
I think you can use the placement new syntax for this purpose. Using this, you can tell where objects shall be constructed:
char memory[10];
int* i = new (memory) int(42);