Accessing large amounts of memory from compute shader?

Accessing large amounts of memory from compute shader? - opengl

I need to access large amounts of data from a GLSL compute shader (read and write).
For reference, I work with an nvidia A6000 gpu with 50GB of memory, the driver is up to date.
Here is what I've tried so far:
Using a SSBO: glBufferData() can allocate arbitrarily large buffers but the shader will only be able to access 2GB of memory (according to my tests).
Using a texture: glTextureStorage3D() can allocate very large textures but the shader will only be able to access 4GB of memory (according to my tests).
Using multiple textures: I break the data in multiple bindless 3D textures (GL_ARB_bindless_texture extension) that work like pages of memory. I store the texture handles in a UBO. It effectively does what I want but there are several downsides:
The texture/image handle pairs take space in the UBO, which could be needed by something else.
The textures are not accessed in a uniform control flow: two invocations of the same subgroup can fetch data in different pages. This is allowed on Nvidia gpus with the GL_NV_gpu_shader5 extension, but I could not find a similar extension on AMD.
Using NV_shader_buffer_load and NV_shader_buffer_store to get a pointer to gpu memory. I haven't tested this method yet, but I suspect it could be more efficient than solution 3 since there is no need to dereference the texture/image handles which introduce an indirection.
Thus, I would like to know: Would solution 4 work for my use case? Is there a better/faster way? Is there a cross-platform way?

Related

What's the most efficient way to write 1 MB from a compute shader and read it in a fragment shader?

In OpenGL, there are so many different kinds of storage options: buffers, textures, framebuffers, etc. in OpenGL it seems very difficult to figure out which to use when.
I want to write a compute buffer that on each frame writes up to around 1 MB to some type of storage that is then read (read-only, random access) by a fragment shader. What is the most efficient mechanism to write that data from the compute shader so it's accessible in the fragment shader?
The data is made up of custom data structures: it is not a specific rendering type of data such as a texture.

Generally there are multiple ways you could handle data transfer across pipeline.
1. UBO
Use when you have fixed amount of data large enough or you can group individual uniforms go with UBO. They are faster that SSBO but have size limitation and cant have variable size.
I see some vendors are even supporting UBO of size 2GB. But I am not sure if that still hold performance benefit of older UBOs. You can query though the limit on UBO size and decide for your self.
2. SSBO
When you have really large data then generally prefer SSBO. They are slower than UBO but can hold much larger data. Internally they are implemented using texture buffer memory.
Generally I prefer limit of 64 K data in UBO and anything larger than that in SSBO.
Another possible way to communicate is to use Transform Feedback. But I am not sure that will be efficient. But considering case when data computed by the compute shader will be processed by application before sending it to fragment shader. This case Transform feedback could be ideal.

How well do opengl drivers handle large texture arrays in limited VRAM

My game engine tries to allocate large texture arrays to be able to batch majority (if not all) of its draw together. This array may become large enough that fails to allocate, at which point I'd (continually) split the texture array in halves.
Is it bad design to push the boundaries until receiving a glGetError:Out of memory and scale back from there?
Is my application a jerk because it's allocating huge chunks of VRAM, which may require swapping into GTT memory? As in, is it less ideal for the graphics driver to be manipulating a few large texture arrays rather than many individual textures when dealing with other OS operations?

It is hard to evaluate how well drivers handle large texture arrays. Behaviour of different drivers may vary a lot.
While using texture array can improve performance by reducing the number of draw calls, that should not be the main goal. Reduction of draw calls is somewhat important on mobile platforms, and even there, several dozens of them is not a problem. I'm not sure about your concerns and what exactly you try to optimise, but I would recommend using profiling tools from GPU vendor before doing any optimisation.
Is it bad design to push the boundaries until receiving a glGetError:Out of memory and scale back from there?
This is what typically done when data is dynamically loaded to the GPU. Once the error is received, old data should be unloaded to load a new one.
Is my application a jerk because it's allocating huge chunks of VRAM, which may require swapping into GTT memory?
There is no way to check if data was swapped to GTT or not (if driver supports GTT at all). The driver handles it on its own, and there is no access to that from OpenGL API. You may need to use profiling tools like Nsight, if you are using a GPU from NVidia.
However, if you are planning to have one giant textures array, it must fit into VRAM as a whole, it can not be partially in VRAM and in GTT. I would not recommend relying on GTT at all.
It must fit into VRAM, because when you bind it, the driver can not know beforehand which layers will be used and which won't since selection happens in the shader.
Despite the fact that textures array and 3dtexture are conceptually different, at hardware level they work very similarly, the difference is that the first one uses filtering in two dimensions and the second one - in three dimensions.
I was playing with large 3d textures for a while. I did experiments with GeForce 1070 (it has 6GB), and it handles textures ~1GB very good. The largest texture I managed to load was around 3GB (2048x2048x7**), but often it throws an error. Despite the fact that it should have a large amount of free VRAM that would fit the texture, it may fail to allocate such big chunk due to various reasons. So I would not recommend allocating textures that are comparable to the total size of VRAM unless it is absolutely necessary.

OpenGL program with Intel HD and NVidia GPU usage

I am new in OpenGL and I want somebody to explain me how the program uses GPU.
I have an array of triangles(class that contains 3 points). Here is the code that draw them( I know these functions are depricated).
glBegin(GL_LINES);
for(int i=0; i<trsize; ++i){
glVertex3d((GLdouble)trarr[i].p1().x(), (GLdouble)trarr[i].p1().y(), (GLdouble)trarr[i].p1().z());
glVertex3d((GLdouble)trarr[i].p2().x(), (GLdouble)trarr[i].p2().y(), (GLdouble)trarr[i].p2().z());
glVertex3d((GLdouble)trarr[i].p3().x(), (GLdouble)trarr[i].p3().y(), (GLdouble)trarr[i].p3().z());
}
glEnd();
And i also use depricated functions for rotating, transforming, etc.
When the size of array is bigger than 50k, the program works very slow.
I tried to use only Intel HD or only NVidia gtx860M (the default NVidia Program allows to choose GPU) but they both works very slow. Maybe Intel HD works even a bit faster.
So, why there is no difference between these two GPUs?
And will the program work faster with using shaders?

The probable bottleneck is looping over the vertices, accessing the array and pulling out the vertex data 50000 times per render then sending the data to the GPU for rendering.
Using a VBO would indeed be faster and compresses the cost of extracting the data and sending it to the GPU to once on initialization.
Even using a user memory buffer would speed it up because you won't be calling 50k functions but the driver can just do a memcopy of the relevant data.

When the size of array is bigger than 50k, the program works very slow.
The major bottleneck when drawing in intermediate mode is, that all your vertices have to be transferred in every frame from your programs memory to the GPU memory. The bus between GPU and CPU is limited in the amout of data it can transfer, so the best guess is, that 50k triangles are simply more than the bus can transport. Another problem is, that the driver has to process all the commands you send him on the CPU, which can also be a big overhead.
So, why there is no difference between these two GPUs?
There is (in general) a huge performance difference between the Intel HD card and a NVIDIA card, but the bus between them might be the same.
And will the program work faster with using shaders?
It will not benefit directly from the user of shaders, but definitely from storing the vertices once on the gpu memory (see VBO/VAO). The second improvement is, that you can render the whole VBO using only one draw call, which decreases the amount of instructions the cpu has to handle.

Seeing the same performance with two GPUs that have substantially different performance potential certainly suggests that your code is CPU limited. But I very much question some of the theories about the performance bottleneck in the other answers/comments.
Some simple calculations suggest that memory bandwidth should not come into play at all. With 50,000 triangles, with 3 vertices each, and 24 bytes per vertex, you're looking at 3,600,000 bytes of vertex data per frame. Say you're targeting 60 frames/second, this is a little over 200 MBytes/second. That's less than 1% of the memory bandwidth of a modern PC.
The most practical implementation of immediate mode on a modern GPU is for drivers to collect all the data into buffers, and then submit it all at once when a buffer fills up. So there's no need for a lot of kernel calls, and the data for each vertex is certainly not sent to the GPU separately.
Driver overhead is most likely the main culprit. With 50,000 triangles, and 3 API calls per triangle, this is 150,000 API calls per frame, or 9 million API calls/second if you're targeting 60 frames/second. That's a lot! For each of these calls, you'll have:
Looping and array accesses in your own code.
The actual function call.
Argument passing.
State management and logic in the driver code.
etc.
One important aspect that makes this much worse than it needs to be: You're using double values for your coordinates. That doubles the amount of data that needs to be passed around compared to using float values. And since the OpenGL vertex pipeline operates in single precision (*), the driver will have to convert all the values to float.
I suspect that you could get a significant performance improvement even with using the deprecated immediate mode calls if you started using float for all your coordinates (both your own storage, and for passing them to OpenGL). You could also use the version of the glVertex*() call that takes a single argument with a pointer to the vector, instead of 3 separate arguments. This would be glVertex3fv() for float vectors.
Making the move to VBOs is of course the real solution. It will reduce the number of API calls by orders of magnitude, and avoid any data copying as long as the vertex data does not change over time.
(*) OpenGL 4.1 adds support for double vertex attributes, but they require the use of specific API functions, and only make sense if single precision floats really are not precise enough.

efficient GPU random memory access with OpenGL

What is the best pattern to get a GPU efficiently calculate 'anti-functional' routines, that usually depend on positioned memory writes instead of reads?
Eg. like calculating a histogram, sorting, dividing a number by percentages, merging data of differing size into lists etc. etc.

The established terms are gather reads and scatter writes
gather reads
This means that your program will write to a fixed position (like the target fragment position of a fragment shader), but has fast access to arbitrary data sources (textures, uniforms, etc.)
scatter writes
This means, that a program receives a stream of input data which it cannot arbitarily address, but can do fast writes to arbitrary memory locations.
Clearly the shader architecture of OpenGL is a gather system. Latest OpenGL-4 also allows some scatter writes in the fragment shader, but they're slow.
So what is the most efficient way, these days, to emulate "scattering" with OpenGL. So far this is using a vertex shader operating on pixel sized points. You send in as many points as you have data-points to process and scatter them in target memory by setting their positions accordingly. You can use geometry and tesselation shaders to yield the points processed in the vertex unit. You can use texture buffers and UBOs for data input, using the vertex/point index for addressing.

GPU's are built with multiple memory types. One type is the DDRx RAM that is accessible to the host CPU and the GPU. In OpenCL and CUDA this called 'global' memory. For GPUs data in global memory must be transferred between the GPU and Host. It's usually arranged in banks to allow for pipelined memory access. So random reads/writes to 'global' memory are comparatively slow. The best way to access 'global' memory is sequentially.
It's size ranges from 1G - 6B per device.
The next type, of memory, is a on the GPU. It's shared memory that is available to a number of threads/warps within a compute unit/multi-processor. This is faster than global memory but not directly accessible from the host. CUDA calls this shared memory.
OpenCL calls this local memory. This is the best memory to use for random access to arrays. For CUDA there is 48K and OpenCL there is 32K.
The third kind of memory are the GPU registers, called private in OpenCL or local in CUDA. Private memory is the fastest but there is less available than local/shared memory.
The best strategy to optimize for random access to memory is to copy data between global and local/shared memory. So a GPU application will copy portions its global memory to local/shared, do work using local/shared and copy the results back to global.
The pattern of copy to local, process using local and copy back to global is an important pattern to understand and learn to program well on GPUs.

Storing many small textures in OpenGL

I'm building an OpenGL app with many small textures. I estimate that I will have a few hundred
textures on the screen at any given moment.
Can anyone recommend best practices for storing all these textures in memory so as to avoid potential performance issues?
I'm also interested in understanding how OpenGL manages textures. Will OpenGL try to store them into GPU memory? If so, how much GPU memory can I count on? If not, how often does OpenGL pass the textures from application memory to the GPU, and should I be worried about latency when this happens?
I'm working with OpenGL 3.3. I intend to use only modern features, i.e. no immediate mode stuff.

If you have a large number of small textures, you would be best off combining them into a single large texture with each of the small textures occupying known sub-regions (a technique sometimes called a "texture atlas"). Switching which texture is bound can be expensive, in that it will limit how much of your drawing you can batch together. By combining into one you can minimize the number of times you have to rebind. Alternatively, if your textures are very similarly sized, you might look into using an array texture (introduction here).
OpenGL does try to store your textures in GPU memory insofar as possible, but I do not believe that it is guaranteed to actually reside on the graphics card.
The amount of GPU memory you have available will be dependent on the hardware you run on and the other demands on the system at the time you run. What exactly "GPU memory" means will vary across machines, it can be discrete and used only be the GPU, shared with main memory, or some combination of the two.
Assuming your application is not constantly modifying the textures you should not need to be particularly concerned about latency issues. You will provide OpenGL with the textures once and from that point forward it will manage their location in memory. Assuming you don't need more texture data than can easily fit in GPU memory every frame, it shouldn't be cause for concern. If you do need to use a large amount of texture data, try to ensure that you batch all use of a certain texture together to minimize the number of round trips the data has to make. You can also look into the built-in texture compression facilities, supplying something like GL_COMPRESSED_RGBA to your call to glTexImage2D, see the man page for more details.
Of course, as always, your best bet will be to test these things yourself in a situation close to your expected use case. OpenGL provides a good number of guarantees, but much will vary depending on the particular implementation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js