I write my own voxel octree raycaster, in compute shader I trace the ray to the first voxel and use a local stack array in which I store intermediate data about the ray movement. In the process, it turned out that this is the bottleneck of my algorithm, when using a shared stack the size of the previous stack for the size of the groups [x * y][stack_size] I got a big performance boost in my algorithm. Now I want to make one big buffer using the Storage Buffer Object. I realize that this is not the best method, but I want to know what performance can be achieved. How can I competently organize the allocation of such a buffer?
Related
I am currently building an application in vulkan where I will be sampling a lot of data from a buffer. I will be using as much storage as possible, but sampling speed is also important. My data is in the form of a 2D array of 32 bit integers. I can either upload it as a texture and use a texture sampler for it, or as a storage buffer. I read that storage buffers are generally slow, so I was considering using the image sampler to read my data in a fragment shader. I would have to disable mipmapping and filtering, and convert UV coordinates to array indices, but if it's faster I think it might be worth it.
My question is, would it generally be worth it to store my data in an image sampler, or should I do the obvious and use a storage buffer? What are the pros/cons of each approach?
Guarantees about performance do not exist.
But Vulkan API tries not to decieve you. The obvious way is likely the right way.
If you want to sample then sample. If you want to do raw access then obviously do raw access. Generally, you should not be forcefully trying to put a square in a round hole.
I'm developing a voxel octree raycaster on a compute shader using OpenGL. It turned out that my algorithm slows down in the calculation process due to the fact that I use a small array in which I store the structure of intermediate data, if you use a shared array for a local group, then the calculations are significantly accelerated, but this is not enough. such an array is allocated only for one group. Is there a way to optimize this?
So Let's say I have 100 different meshes that all use the same OpenGL shader. Reading OpenGL best practices apparently I should place them into the same vertex buffer object and draw them using glDrawElementsBaseVertex. Now my question is, if I only render a fraction of these meshes every frame, am I wasting resources by having all these meshes in the same vertex buffer object? What are the best practices for batching in this context?
Also are there any guidelines or ways I can determine how much should be placed into a single vertex buffer object?
if I only render a fraction of these meshes every frame, am I wasting resources by having all these meshes in the same vertex buffer object?
What resources could you possibly be wasting? The mere act of rendering doesn't use resources. And since you're going to render those other meshes sooner or later, it's better to have them in memory than to have to DMA them up.
Of course, this has to be balanced against the question of how much stuff you can fit into memory. It's a memory vs. performance tradeoff, and you have to decide for yourself and your application how appropriate it is to keep data you're not actively using around.
Common techniques for dealing with this include streaming. That is, what data is in memory depends on where you are in the scene. As you move through the world, new data for new areas is loaded in, overwriting data for old areas.
Also are there any guidelines or ways I can determine how much should be placed into a single vertex buffer object?
As much as you possibly can. The general rule of thumb is that the number of buffer objects you have should not vary with the number of objects you render.
Let's say I have 5 entities (objects) with a method Render(). Every entity needs to set its own vertices in a buffer for rendering.
Which of the following two options is better?
Use one big pre-allocated buffer created with glGenBuffer, which every entity will use (id of buffer passed as argument to Render methods) by writing its vertices to the buffer with glBufferSubData.
Every entity creates and uses its own buffer.
If one big buffer is better, how can I render all vertices in this buffer (from all entities) properly, with proper shaders and everything?
Having multiple VBOs is fine as long as they have a certain size. What you want to avoid is to have a lot of small draw calls, and to have to bind different buffers very frequently.
How large the buffers have to be to avoid excessive overhead depends on so many factors that it's almost impossible to even give a rule of thumb. Factors that come into play include:
Hardware performance characteristics.
Driver efficiency.
Number of vertices relative to number of fragments (triangle size).
Complexity of shaders.
Generally it can make sense to keep similar/related objects that you typically draw at the same time in a single vertex buffer.
Putting everything in a single buffer seems extreme, and could in fact have adverse effects. Say you have a large "world", where you only render a small subset in any given frame. If you go to the extreme, an have all vertices in one giant buffer, that buffer needs to be accessible to the GPU for each draw call. Depending on the architecture, and how the buffer is allocated, this could mean:
Attempting to keep the buffer in dedicate GPU memory (e.g. VRAM), which could be problematic if it's too large.
Mapping the memory into the GPU address space.
Pinning/wiring the memory.
If any of the above needs to be applied to a very large buffer, but you end up using only a small fraction of it to render a frame, there is significant waste in these operations. In a system with VRAM, it could also prevent other allocations, like textures, to fit in VRAM.
If rendering is done with calls that can only access a subset of the buffer given by the arguments, like glDrawArrays() or glDrawRangeElements(), it might be possible for the driver to avoid making the whole buffer GPU accessible. But I wouldn't necessarily count on that happening.
It's easier to use one VBO (Vertex Buffer Object) with glGenBuffer for each entity you have but it's not always the best things to do, this depend on the use. But, in most cases, this is not a problem to have 1 VBO for each entity and the rendering is rarely affected.
Good info is located at: OpenGL Vertex Specification Best Practices
So currently what I am doing is before loading my elements onto a VBO I create a new matrix and I add them to it. I do that so I can play with the matrix as much as I want.
So what I did is I just added the camera position onto the coordinates in the matrix.
Note: the actual position of the objects is saved elsewhere the matrix is a translation stage.
Now, this works but I am not sure if its correct or if I should translate to the camera location in the shaders instead of in the CPU.
So this is my question:
Should the camera translation happen on the GPU or the CPU?
I am not entirely sure what you are currently doing. But the sane way of doing this is to not touch the VBO. Instead, pass one or more transformation matrices as uniforms to your vertex shader and perform the matrix multiplication on the GPU.
Changing your VBO data on the CPU is insane, it means either keeping a copy of your vertex data on the CPU, iterating over it and uploading or mapping the buffer and iterating over it. Either way, it would be insanely slow. The whole point of having a VBO is so you can upload your vertex data once and work concurrently on the CPU while the GPU buggers off and does its thing with said vertex data.
Instead, you just store your vertices once in the vertex buffer, preferably in object space (just for sanity's sake). Then you keep track of a transformation matrix for each object which transforms the vertices from the object's space to clip space. You pass that matrix to your vertex shader and do the multiplications for each vertex on the GPU.
Obviously the GPU is multiplying every vertex by at least one matrix each frame. But the GPU has parallel hardware which does matrix multiplication insanely fast. So especially when your matrices constantly change (e.g. your objects move) this is much faster than doing it on the CPU and updating a massive buffer. Besides, you free your CPU to do other things like physics or audio or whatever.
Now I can imagine you would want to NOT do this if your object never moves, however, GPU matrix multiplication is probably about the same speed as a CPU float multiplication (I don't know specifics). So it is questionable if having more shaders for static objects is worth it.
Summary:
Updating buffers on the CPU is SLOW.
Matrix multiplication on the GPU is FAST.
No buffer update? = free up the CPU.
Multiplications on the GPU? = easy and fast to move objects (just change the matrix you upload).
Hope this somewhat helped.