What is the best pattern to get a GPU efficiently calculate 'anti-functional' routines, that usually depend on positioned memory writes instead of reads?
Eg. like calculating a histogram, sorting, dividing a number by percentages, merging data of differing size into lists etc. etc.
The established terms are gather reads and scatter writes
gather reads
This means that your program will write to a fixed position (like the target fragment position of a fragment shader), but has fast access to arbitrary data sources (textures, uniforms, etc.)
scatter writes
This means, that a program receives a stream of input data which it cannot arbitarily address, but can do fast writes to arbitrary memory locations.
Clearly the shader architecture of OpenGL is a gather system. Latest OpenGL-4 also allows some scatter writes in the fragment shader, but they're slow.
So what is the most efficient way, these days, to emulate "scattering" with OpenGL. So far this is using a vertex shader operating on pixel sized points. You send in as many points as you have data-points to process and scatter them in target memory by setting their positions accordingly. You can use geometry and tesselation shaders to yield the points processed in the vertex unit. You can use texture buffers and UBOs for data input, using the vertex/point index for addressing.
GPU's are built with multiple memory types. One type is the DDRx RAM that is accessible to the host CPU and the GPU. In OpenCL and CUDA this called 'global' memory. For GPUs data in global memory must be transferred between the GPU and Host. It's usually arranged in banks to allow for pipelined memory access. So random reads/writes to 'global' memory are comparatively slow. The best way to access 'global' memory is sequentially.
It's size ranges from 1G - 6B per device.
The next type, of memory, is a on the GPU. It's shared memory that is available to a number of threads/warps within a compute unit/multi-processor. This is faster than global memory but not directly accessible from the host. CUDA calls this shared memory.
OpenCL calls this local memory. This is the best memory to use for random access to arrays. For CUDA there is 48K and OpenCL there is 32K.
The third kind of memory are the GPU registers, called private in OpenCL or local in CUDA. Private memory is the fastest but there is less available than local/shared memory.
The best strategy to optimize for random access to memory is to copy data between global and local/shared memory. So a GPU application will copy portions its global memory to local/shared, do work using local/shared and copy the results back to global.
The pattern of copy to local, process using local and copy back to global is an important pattern to understand and learn to program well on GPUs.
Related
I need to access large amounts of data from a GLSL compute shader (read and write).
For reference, I work with an nvidia A6000 gpu with 50GB of memory, the driver is up to date.
Here is what I've tried so far:
Using a SSBO: glBufferData() can allocate arbitrarily large buffers but the shader will only be able to access 2GB of memory (according to my tests).
Using a texture: glTextureStorage3D() can allocate very large textures but the shader will only be able to access 4GB of memory (according to my tests).
Using multiple textures: I break the data in multiple bindless 3D textures (GL_ARB_bindless_texture extension) that work like pages of memory. I store the texture handles in a UBO. It effectively does what I want but there are several downsides:
The texture/image handle pairs take space in the UBO, which could be needed by something else.
The textures are not accessed in a uniform control flow: two invocations of the same subgroup can fetch data in different pages. This is allowed on Nvidia gpus with the GL_NV_gpu_shader5 extension, but I could not find a similar extension on AMD.
Using NV_shader_buffer_load and NV_shader_buffer_store to get a pointer to gpu memory. I haven't tested this method yet, but I suspect it could be more efficient than solution 3 since there is no need to dereference the texture/image handles which introduce an indirection.
Thus, I would like to know: Would solution 4 work for my use case? Is there a better/faster way? Is there a cross-platform way?
TL;DR: Do I need to mirror read-only lookup textures and input buffers across multiple devices when doing Multi-GPU programming with CUDA (whether it is a strict requirement or for best performance)?
I have a GPU kernel which takes in two textures for lookups and two (smallish) buffers for input data.
I've expanded my code to allow for multiple GPUs (our system will have a max of 8, but for testing I'm on a smaller dev system using only 2). Our system uses NVLINK and we have UVA enabled.
My setup involves making device 0 a sort of "master" or "root" device where the final result is stored and the final serial (serial as in only executable on one GPU) operations occur. All devices are set up to allow peer access to dev 0. The kernel is invoked multiple times on each device in a loop of the form:
for(unsigned int f = 0; f < maxIterations; f++)
{
unsigned int devNum = f % maxDevices; //maxIterations >> maxDevices
cudaSetDevice(devNum);
cudaDeviceSynchronize(); //Is this really needed?
executeKernel<<<>>>(workBuffers[devNum], luTex1, luTex2, inputBufferA, inputBufferB);
cudaMemcpyAsync(&bigGiantBufferOnDev0[f * bufferStride],
workBuffers[devNum],
sizeof(float) * bufferStride,
cudaMemcpyDeviceToDevice);
}
As one can see, each device has its own "work buffer" for writing out intermediate results, and these results are then memcpy'd to device 0.
The work (output) buffers are several orders of magnitude larger in size than the input buffers, and I noticed when I'd made a mistake and accessed buffers across devices that there was a major performance hit (presumably because kernels were accessing memory on another device). I haven't however noticed a similar hit with the read only input buffers after fixing the output buffer issue.
Which brings me to my question: Do I actually need to mirror these input buffers and textures across devices, or is there a caching mechanism that makes this unnecessary? Why do I notice such a massive performance hit when accessing the work buffers across devices, but seemingly incur no such penalty with the inputs buffers/textures?
Texturing, as well as ordinary global data access, is possible "remotely", if you have enabled peer access. Since such access occurs over NVLink (or the peer-capable fabric), it will generally be slower.
For "smallish" input buffers, it may be that the GPU caching mechanisms tend to reduce or mitigate the penalty associated with remote access. The GPU has specific read-only caches that are designed to help with read-only/input data, and of course the texturing mechanism has its own cache. Detailed performance statements are not possible unless actual analysis is done with actual code.
If you use > Pascal level gpu, they have unified memory. you don't need data migration.
When code running on a CPU or GPU accesses data allocated this way (often called CUDA managed data), the CUDA system software and/or the hardware takes care of migrating memory pages to the memory of the accessing processor.
https://devblogs.nvidia.com/unified-memory-cuda-beginners/
if you use the old school way to allocation buffer (cuMalloc), you do need to mirror data I think.
Not sure what the DX parlance is for these, but I'm sure they have a similar notion.
As far as I'm aware the advantage of VBO's is that they allocate memory that's directly available by the GPU. We can then upload data to this buffer, and keep it there for an extended number of frames, preventing all the overhead of uploading the data every frame. Additionally, we're able to alter this data on a per-datum basis, if we choose to.
Therefore, I can see the advantage of using VBO's for static geo, but I don't see any benefit at all for dynamic objects - since you pretty much have to update all the data every frame anyways?
There are several methods of updating buffers in OpenGL. If you have dynamic data, you can simply reinitialize the buffer storage every frame with the new data (eg. with glBufferData). You can also use client vertex buffer pointers, in compatibility contexts. However, these methods can cause 'churn' in the memory allocation of the driver. The new data storage essentially has to sit in system memory until the GPU driver handles it, and it's not possible to get feedback on this process.
In later OpenGL versions (4.4, and via extensions in earlier versions), some functionality was introduced to try and reduce the overhead of updating dynamic buffers, allowing for GPU allocated memory to be written without direct driver synchronization. This essentially requires that you have the glBufferStorage and glMapBufferRange functionality available. You create the buffer storage with the GL_DYNAMIC_STORAGE_BIT, and then map it with GL_MAP_PERSISTENT_BIT (you may require other flags, depending on whether you are reading and/or writing the data). However, this technique also requires that you use GPU fencing to ensure you are not overwriting the data as the GPU is reading it. Using this method makes updating VBOs much more efficient than reinitializing the data store, or using client buffers.
There is a good presentation on GDC Vault about this technique (skip to the DynamicStreaming heading).
AFAIK, by creating dynamic vertex buffer, you are giving graphic adapter driver a hint to place vertex buffer in memory which fast for CPU to write but also reasonably fast for GPU to read it. Driver usually will manage it to minimize GPU stall by giving non-overlapping memory area so that CPU can write while GPU read other memory area.
If you do not give hint, it is assume a static resource so it will be placed in memory which fast for GPU to read/write but very slow for CPU to write.
On a compute capability 2.x device how would I make sure that the gpu uses coalesced memory access when using mapped pinned memory and assuming that normally when using global memory the 2D data would require padding?
I can't seem to find information about this anywhere, perhaps I should be looking better or perhaps I am missing something. Any pointers in the right direction are welcome...
The coalescing approach should be applied when using zero copy memory. Quoting the CUDA C BEST PRACTICES GUIDE:
Because the data is not cached on the GPU, mapped
pinned memory should be read or written only once, and the global loads and stores
that read and write the memory should be coalesced.
Quoting the "CUDA Programming" book, by S. Cook
If you think about what happens with access to global memory, an entire cache line is brought in from memory on compute 2.x hardware. Even on compute 1.x hardware the same 128 bytes, potentially reduced to 64 or 32, is fetched from global memory.
NVIDIA does not publish the size of the PCI-E transfers it uses, or details on how zero copy is actually implemented. However, the coalescing approach used for global memory could be used with PCI-E transfer. The warp memory latency hiding model can equally be applied to PCI-E transfers, providing there is enough arithmetic density to hide the latency of the PCI-E transfers.
In my OpenGL program I read from the header file to find out the geometry size and then malloc an Indice array and a Vertex array which I then pass to the VBO, is it even possible to read directly from the hard drive or is the GPUs memory linked to the computer's RAM only?
The GPU is not directly connected to the system RAM. There's a bus inbetween, in current generation computers a PCI-Express bus. ATA storage has a controller inbetween. There is no direct connection between memories.
BUT there is DMA, which allows certain periphials to directly access the system memory through DMA channels. DMA on PCI-Express also works between periphials. Theoretically a GPU could do DMA to the ATA controller.
Practically this is of little use! Why? Because of filesystems. Even if there was some kind of driver support to let the GPU access a storage periphial directly, it'd still have to do all the filesystem business, which doesn't parallelize to the degree as GPUs are designed for.
Alas regarding your question:
In my OpenGL program I read from the header file to find out the geometry size and then malloc an Indice array and a Vertex array which I then pass to the VBO, is it even possible to read directly from the hard drive or is the GPUs memory linked to the computer's RAM only?
Why no simply memory map those files? That way you avoid allocating a buffer you're first reading into, and passing a memory mapped file pointer to OpenGL does allow the driver to actually perform a DMA transfer between the storage driver buffers and the GPU, which is as close as it gets to your original request. Of course the data on the storage device must be prepared in a format that's suitable for the GPU, otherwise it's of little use. If it requires some preprocessing, the best thing to use is the CPU. GPUs don't like to serialize data.