why CUDA cuGraphInstantiate() allocates so much memory? - c++

my CUDA code creates 4 read/write GPU buffers in total around 100MB in size
no more GPU buffers are created in the code
then creates a graph with 1K nodes and they all just read/write to these 4 buffers
just to clarify, the graph also does not create any buffers
windows task manager shows cuGraphInstantiate() allocates 8GB of GPU memory
How does cuGraphInstantiate() work underneat so that it needs to allocate that amount?
my cuDriverGetVersion() = 12000
the same algorithm that i am writing in CUDA already works on directx12 compute and when creating the command list (which is similar to CUDA graphs) with 1K dispatch() it never allocates 8GB of memory

Related

How are CUDA arrays stored in GPU memory? Are they physically linear or not?

According to the CUDA TOOLKIT DOCUMENTATION:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/
Device memory can be allocated either as linear memory or as CUDA arrays.
Does this mean that the CUDA arrays are not stored linearly in GPU memory?
In my experiment, I successfully dumped my data from GPU memory based on the cudamemcpy function. If my data is allocated by cudaMallocArray, does it mean that the data are not physically linear in GPU memory and need to be extracted by other API?
CUDA arrays are indeed stored in GPU device memory ("global" memory), and the bytes are not physically linear in memory. They are an opaque layout optimized for multichannel, multidimensional texture access and texture filtering. The format is undocumented, since it may change between GPU architectures.

With OpenCL, How to get GPU memory usage?

I'm looking for a reliable way to determine current GPU memory usage with OpenCL.
I have found NVidia API: cudaMemGetInfo( size_t* free, size_t* total ) to get free memory and total memory on the current device.
But I'm looking for a solution for AMD and OpenCL. I did not find if there is similar functionality in OpenCL and I don't know if AMD has something equivalent.
I don't want to know how much free memory there is on OpenCL devices before allocating buffers but free memory afer allocating buffers.
A priori as indicated in How do I determine available device memory in OpenCL?, With OpenCL, there is no way, and there is no need to know it.
devices before allocating buffers but free memory afer allocating buffers.
For AMD, perhaps try CL_DEVICE_GLOBAL_FREE_MEMORY_AMD from the cl_amd_device_attribute_query extension - this extension will probably only work with proprietary drivers, though.
In general case, it's impossible, because AFAIK there's no way to know when buffers are allocated (on the device). In this sense OpenCL is higher-level than CUDA. Buffers belong to contexts, not devices. Calling clCreateBuffer() can but doesn't have to allocate any memory on any device; the implementations automatically migrate buffers to device memory before they execute a kernel which needs those buffers, and move them away from the device if they need to free memory for next kernel. Even if you get the free memory of a device, you can't 100% reliably use it to make decisions on whether to run a kernel, because clEnqueueNDRange() doesn't necessarily immediately launch a kernel (it just enqueues it; if there's something else in the queue, it can be delayed), and some other application on the same computer could get scheduled on the GPU in meantime.
If you want to avoid swapping memory, you'll have to make sure 1) your application is the only one using the GPU, 2) for each of your kernels, total buffer arguments size must be <= GLOBAL_MEM_SIZE.

Directx 11 Memory Management

I've been studying Directx 11 for a while now, but I'm still confused on how Directx 11 manages memory. For example, if I create a vertex buffer using ID3D11Device::CreateBuffer, where is the new buffer stored? I know it returns a pointer to the buffer, so that means it must be stored on CPU RAM right? However, I thought that this would make ID3D11DeviceContext::IASetVertexBuffers a very slow process, because it would have to copy the buffer from CPU RAM to GPU RAM. But if all of the buffers created with ID3D11Device:CreateBuffer were stored on GPU RAM, then wouldn't the GPU RAM fill up really quickly? Basically I would like to know: when I create a buffer, where is that data stored? In CPU RAM or GPU RAM? Also, what is ID3D11DeviceContext::IASetVertexBuffers doing with the buffer (copying/setting?).
The general answer is that "it's wherever the driver wants it to be". For "DYNAMIC" resources, they are typically put into memory that is accessible to both the CPU and the GPU (on modern PCs this is shared across the PCIe bus). For "STATIC" resources, they can be on video RAM that is only accessible by the GPU which is copied via the 'shared' memory window, or if there's limited space they are put in the 'shared' memory window. Render Targets are usually put in video RAM as well.
For a deeper dive on Direct3D video memory management, check out the talk "Why Your Windows Game Won't Run In 2,147,352,576?" which is no longer on MSDN Downloads but can be found on my blog.
If you want the nitty-gritty hardware details, read the driver writer's documentation.
You may also find the Video Memory sample on MSDN Code Gallery educational.

Create OpenGL object that always stays in GPU memory

Is there a method to create an OpenGL object that always stays in GPU memory (texture or buffer)? OpenGL can unload objects to RAM memory. But my purpose is to fill GPU memory. For example: I have 1 GB GPU memory and my app needs to fill 512 MB of GPU memory.
Is there a method to create OpenGL object, what stay in GPU memory always
No.
But my purpose is fill GPU memory.
In other words you try to Denial-of-Service the GPU. Doesn't work. The OS/driver will decide to make space for other stuff, that needs to be drawn there and now. Many OSs these days rely on the GPUs 3D acceleration to draw their userinterfaces. The GPU always must be responsive.
Also modern GPUs have MMUs and can fetch only subsets of data in a larger object.

how to design a real time CUDA financial application

I have a general question about how to design my application. I have read the Cuda document, but still don't know what I should look into. Really appreciate it if someone could shed a light on it.
I want to do some real time analytics about stocks, say 100 stocks. And I have real time market data feed which will stream with updated market price. What I want to do are:
pre-allocate memory black for each stock on the cuda card, and keep the memory during the day time.
when new data coming in, directly update the corresponding memory on Cuda card.
After updating, it issue signal or trigger event to start analytical calculation.
When calculation is done, write the result back to CPU memory.
Here are my questions:
what's the most efficient way to stream data from CPU memory to GPU memory? Because I want it in real time, so copying memory snapshot from CPU to GPU every second is not acceptable.
I may need to allocate memory block for 100 stocks both on CPU and GPU. How to mapping the CPU memory cell to each GPU memory cell?
How to trigger the analytics calculation when the new data arrive on Cuda card?
I am using a Tesla C1060 with Cuda 3.2 on Windows XP.
Thank you very much for any suggestion.
There is nothing unusual in your requirements.
You can keep information in GPU memory as long as your application is running, and do small updates to keep the data in sync with what you have on the CPU. You can allocate your GPU memory with cudaMalloc() and use cudaMemcpy() to write updated data into sections of the allocated memory. Or, you can hold data in a Thrust structure, such as a thrust::device_vector. When you update the device_vector, CUDA memory copies are done in the background.
After you have updated the data, you simply rerun your kernel(s) to get updated results for your calculation.
Could you expand on question (2)?