I'm developing a voxel octree raycaster on a compute shader using OpenGL. It turned out that my algorithm slows down in the calculation process due to the fact that I use a small array in which I store the structure of intermediate data, if you use a shared array for a local group, then the calculations are significantly accelerated, but this is not enough. such an array is allocated only for one group. Is there a way to optimize this?
Related
I need to access large amounts of data from a GLSL compute shader (read and write).
For reference, I work with an nvidia A6000 gpu with 50GB of memory, the driver is up to date.
Here is what I've tried so far:
Using a SSBO: glBufferData() can allocate arbitrarily large buffers but the shader will only be able to access 2GB of memory (according to my tests).
Using a texture: glTextureStorage3D() can allocate very large textures but the shader will only be able to access 4GB of memory (according to my tests).
Using multiple textures: I break the data in multiple bindless 3D textures (GL_ARB_bindless_texture extension) that work like pages of memory. I store the texture handles in a UBO. It effectively does what I want but there are several downsides:
The texture/image handle pairs take space in the UBO, which could be needed by something else.
The textures are not accessed in a uniform control flow: two invocations of the same subgroup can fetch data in different pages. This is allowed on Nvidia gpus with the GL_NV_gpu_shader5 extension, but I could not find a similar extension on AMD.
Using NV_shader_buffer_load and NV_shader_buffer_store to get a pointer to gpu memory. I haven't tested this method yet, but I suspect it could be more efficient than solution 3 since there is no need to dereference the texture/image handles which introduce an indirection.
Thus, I would like to know: Would solution 4 work for my use case? Is there a better/faster way? Is there a cross-platform way?
I write my own voxel octree raycaster, in compute shader I trace the ray to the first voxel and use a local stack array in which I store intermediate data about the ray movement. In the process, it turned out that this is the bottleneck of my algorithm, when using a shared stack the size of the previous stack for the size of the groups [x * y][stack_size] I got a big performance boost in my algorithm. Now I want to make one big buffer using the Storage Buffer Object. I realize that this is not the best method, but I want to know what performance can be achieved. How can I competently organize the allocation of such a buffer?
I am new in OpenGL and I want somebody to explain me how the program uses GPU.
I have an array of triangles(class that contains 3 points). Here is the code that draw them( I know these functions are depricated).
glBegin(GL_LINES);
for(int i=0; i<trsize; ++i){
glVertex3d((GLdouble)trarr[i].p1().x(), (GLdouble)trarr[i].p1().y(), (GLdouble)trarr[i].p1().z());
glVertex3d((GLdouble)trarr[i].p2().x(), (GLdouble)trarr[i].p2().y(), (GLdouble)trarr[i].p2().z());
glVertex3d((GLdouble)trarr[i].p3().x(), (GLdouble)trarr[i].p3().y(), (GLdouble)trarr[i].p3().z());
}
glEnd();
And i also use depricated functions for rotating, transforming, etc.
When the size of array is bigger than 50k, the program works very slow.
I tried to use only Intel HD or only NVidia gtx860M (the default NVidia Program allows to choose GPU) but they both works very slow. Maybe Intel HD works even a bit faster.
So, why there is no difference between these two GPUs?
And will the program work faster with using shaders?
The probable bottleneck is looping over the vertices, accessing the array and pulling out the vertex data 50000 times per render then sending the data to the GPU for rendering.
Using a VBO would indeed be faster and compresses the cost of extracting the data and sending it to the GPU to once on initialization.
Even using a user memory buffer would speed it up because you won't be calling 50k functions but the driver can just do a memcopy of the relevant data.
When the size of array is bigger than 50k, the program works very slow.
The major bottleneck when drawing in intermediate mode is, that all your vertices have to be transferred in every frame from your programs memory to the GPU memory. The bus between GPU and CPU is limited in the amout of data it can transfer, so the best guess is, that 50k triangles are simply more than the bus can transport. Another problem is, that the driver has to process all the commands you send him on the CPU, which can also be a big overhead.
So, why there is no difference between these two GPUs?
There is (in general) a huge performance difference between the Intel HD card and a NVIDIA card, but the bus between them might be the same.
And will the program work faster with using shaders?
It will not benefit directly from the user of shaders, but definitely from storing the vertices once on the gpu memory (see VBO/VAO). The second improvement is, that you can render the whole VBO using only one draw call, which decreases the amount of instructions the cpu has to handle.
Seeing the same performance with two GPUs that have substantially different performance potential certainly suggests that your code is CPU limited. But I very much question some of the theories about the performance bottleneck in the other answers/comments.
Some simple calculations suggest that memory bandwidth should not come into play at all. With 50,000 triangles, with 3 vertices each, and 24 bytes per vertex, you're looking at 3,600,000 bytes of vertex data per frame. Say you're targeting 60 frames/second, this is a little over 200 MBytes/second. That's less than 1% of the memory bandwidth of a modern PC.
The most practical implementation of immediate mode on a modern GPU is for drivers to collect all the data into buffers, and then submit it all at once when a buffer fills up. So there's no need for a lot of kernel calls, and the data for each vertex is certainly not sent to the GPU separately.
Driver overhead is most likely the main culprit. With 50,000 triangles, and 3 API calls per triangle, this is 150,000 API calls per frame, or 9 million API calls/second if you're targeting 60 frames/second. That's a lot! For each of these calls, you'll have:
Looping and array accesses in your own code.
The actual function call.
Argument passing.
State management and logic in the driver code.
etc.
One important aspect that makes this much worse than it needs to be: You're using double values for your coordinates. That doubles the amount of data that needs to be passed around compared to using float values. And since the OpenGL vertex pipeline operates in single precision (*), the driver will have to convert all the values to float.
I suspect that you could get a significant performance improvement even with using the deprecated immediate mode calls if you started using float for all your coordinates (both your own storage, and for passing them to OpenGL). You could also use the version of the glVertex*() call that takes a single argument with a pointer to the vector, instead of 3 separate arguments. This would be glVertex3fv() for float vectors.
Making the move to VBOs is of course the real solution. It will reduce the number of API calls by orders of magnitude, and avoid any data copying as long as the vertex data does not change over time.
(*) OpenGL 4.1 adds support for double vertex attributes, but they require the use of specific API functions, and only make sense if single precision floats really are not precise enough.
So currently what I am doing is before loading my elements onto a VBO I create a new matrix and I add them to it. I do that so I can play with the matrix as much as I want.
So what I did is I just added the camera position onto the coordinates in the matrix.
Note: the actual position of the objects is saved elsewhere the matrix is a translation stage.
Now, this works but I am not sure if its correct or if I should translate to the camera location in the shaders instead of in the CPU.
So this is my question:
Should the camera translation happen on the GPU or the CPU?
I am not entirely sure what you are currently doing. But the sane way of doing this is to not touch the VBO. Instead, pass one or more transformation matrices as uniforms to your vertex shader and perform the matrix multiplication on the GPU.
Changing your VBO data on the CPU is insane, it means either keeping a copy of your vertex data on the CPU, iterating over it and uploading or mapping the buffer and iterating over it. Either way, it would be insanely slow. The whole point of having a VBO is so you can upload your vertex data once and work concurrently on the CPU while the GPU buggers off and does its thing with said vertex data.
Instead, you just store your vertices once in the vertex buffer, preferably in object space (just for sanity's sake). Then you keep track of a transformation matrix for each object which transforms the vertices from the object's space to clip space. You pass that matrix to your vertex shader and do the multiplications for each vertex on the GPU.
Obviously the GPU is multiplying every vertex by at least one matrix each frame. But the GPU has parallel hardware which does matrix multiplication insanely fast. So especially when your matrices constantly change (e.g. your objects move) this is much faster than doing it on the CPU and updating a massive buffer. Besides, you free your CPU to do other things like physics or audio or whatever.
Now I can imagine you would want to NOT do this if your object never moves, however, GPU matrix multiplication is probably about the same speed as a CPU float multiplication (I don't know specifics). So it is questionable if having more shaders for static objects is worth it.
Summary:
Updating buffers on the CPU is SLOW.
Matrix multiplication on the GPU is FAST.
No buffer update? = free up the CPU.
Multiplications on the GPU? = easy and fast to move objects (just change the matrix you upload).
Hope this somewhat helped.
What is the best pattern to get a GPU efficiently calculate 'anti-functional' routines, that usually depend on positioned memory writes instead of reads?
Eg. like calculating a histogram, sorting, dividing a number by percentages, merging data of differing size into lists etc. etc.
The established terms are gather reads and scatter writes
gather reads
This means that your program will write to a fixed position (like the target fragment position of a fragment shader), but has fast access to arbitrary data sources (textures, uniforms, etc.)
scatter writes
This means, that a program receives a stream of input data which it cannot arbitarily address, but can do fast writes to arbitrary memory locations.
Clearly the shader architecture of OpenGL is a gather system. Latest OpenGL-4 also allows some scatter writes in the fragment shader, but they're slow.
So what is the most efficient way, these days, to emulate "scattering" with OpenGL. So far this is using a vertex shader operating on pixel sized points. You send in as many points as you have data-points to process and scatter them in target memory by setting their positions accordingly. You can use geometry and tesselation shaders to yield the points processed in the vertex unit. You can use texture buffers and UBOs for data input, using the vertex/point index for addressing.
GPU's are built with multiple memory types. One type is the DDRx RAM that is accessible to the host CPU and the GPU. In OpenCL and CUDA this called 'global' memory. For GPUs data in global memory must be transferred between the GPU and Host. It's usually arranged in banks to allow for pipelined memory access. So random reads/writes to 'global' memory are comparatively slow. The best way to access 'global' memory is sequentially.
It's size ranges from 1G - 6B per device.
The next type, of memory, is a on the GPU. It's shared memory that is available to a number of threads/warps within a compute unit/multi-processor. This is faster than global memory but not directly accessible from the host. CUDA calls this shared memory.
OpenCL calls this local memory. This is the best memory to use for random access to arrays. For CUDA there is 48K and OpenCL there is 32K.
The third kind of memory are the GPU registers, called private in OpenCL or local in CUDA. Private memory is the fastest but there is less available than local/shared memory.
The best strategy to optimize for random access to memory is to copy data between global and local/shared memory. So a GPU application will copy portions its global memory to local/shared, do work using local/shared and copy the results back to global.
The pattern of copy to local, process using local and copy back to global is an important pattern to understand and learn to program well on GPUs.