Occlusion Queries and Instanced Rendering - opengl

I'm facing a problem where the use of an occlusion query in combination with instanced rendering would be desirable.
As far as I understood, something like
glBeginQuery(GL_ANY_SAMPLES_PASSED, occlusionQuery);
glDrawArraysInstanced(mode, i, j, countInstances);
glEndQuery(GL_ANY_SAMPLES_PASSED);
will only tell me, if any of the instances were drawn.
What I would need to know is, what set of instances has been drawn (giving me the IDs of all visible instances). Drawing each instance in an own call is no option for me.
An alternative would be to color-code the instances and detect the visible instances manually.
But is there really no way to solve this problem with a query command and why would it not be possible?

It's not possible for several reasons.
Query objects only contain a single counter value. What you want would require a separate sample passed count for each instance.
Even if query objects stored arrays of sample counts, you can issue more than one draw call in the begin/end scope of a query. So how would OpenGL know which part of which draw call belonged to which query value in the array? You can even change other state within the query scope; uniform bindings, programs, pretty much anything.
The samples-passed count is determined entirely by the rasterizer hardware on the GPU. And the rasterizer neither knows nor cares which instance generated a triangle.
Instancing is a function of the vertex processing and/or vertex specification stages; by the time the rasterizer sees it, that information is gone. Notice that fragment shaders don't even get an instance ID as input, unless you explicitly create one by passing it from your vertex processing stage(s).
However, if you truly want to do this you could use image load/store and its atomic operations. That is, pass the fragment shader the instance in question (as an int data type, with flat interpolation). This FS also uses a uimageBuffer buffer texture, which uses the GL_R32UI format (or you can use an SSBO unbounded array). It then performs an imageAtomicAdd, using the instance value passed in as the index to the buffer. Oh, and you'll need to have the FS explicitly require early tests, so that samples which fail the fragment tests will not execute.
Then use a compute shader to build up a list of rendering commands for the instances which have non-zero values in the array. Then use an indirect rendering call to draw the results of this computation. Now obviously, you will need to properly synchronize access between these various operations. So you'll need to use appropriate glMemoryBarrier calls between each one.
Even if queries worked the way you want them to, this would be overall far more preferable than using a query object. Unless you're reading a query into a buffer object, reading a query object requires a GPU/CPU synchronization of some form. Whereas the above requires some synchronization and barrier operations, but they're all on-GPU operations, rather than synchronizing with the CPU.

Related

glMultiDraw functions and gl_InstanceID

When I look at the documentation of glMultiDrawElementsIndirect (or in the Wiki) it says that a single call to glMultiDrawElementsIndirect is equivalent to repeatedly calling glDrawElementsIndirect (just with different parameters).
Does that mean that gl_InstanceID will reset for each of these "internal" calls? And if so, how am I able to tell all these calls apart in my vertex shader?
Background: I'm trying to draw all my different meshes all at once. But I need some way to know to which mesh the vertex, I'm processing in my vertex shader, belongs.
The documentation says "similarly to". "Equivalent" isn't the same thing. It also points to glDrawElementsInstancedBaseVertexBaseInstance, not glDrawElementsInstanced.
But yes, gl_InstanceId for any draw will start at zero, no matter what base instance you provide. That's how gl_InstanceId works, unfortunately.
Besides, that's not the question you want answered. You're not looking to ask which instance you're rendering, since each draw in the multi-draw can be rendering multiple instances. You're asking which draw in the multi-draw you are in. An instance ID isn't going to help.
And if so, how am I able to tell all these calls apart in my vertex shader?
Unless you have OpenGL 4.6 or ARB_shader_draw_parameters, you can't. Well, not directly.
That is, multidraw operations are expected to produce different results based on rendering from different parts of the current buffer objects, not based on computations in the shader. You're rendering with a different base vertex that selects different vertices from the arrays, or you're using different ranges of indices or whatever.
The typical pre-shader_draw_parameters solution would have been to use a unique base instance on each of the individual draws. Of course, since gl_InstanceId doesn't track the base instance (as previously stated), you would need to employ instanced arrays instead. So you'd get the mesh index from that.
Of course, 4.6/shader_draw_parameters gives you gl_DrawId, which just tells you what the index is within the multidraw command. It's also dynamically uniform, so you can use it to access arrays of opaque types in shaders.

Does OpenGL allow you to provide array attributes when using glMultiDrawArrays?

I'm drawing a bunch of primitives (in my case lines) using the glMultiDrawArrays command. Each of these arrays (lines) have some additional attribute(s) specific to that array.
I would essentially like to pass these "array attributes" as separate uniforms specific to each array.
The two ways I can think of now is:
Draw each array (line) in separate draw calls and specify the attribute as a uniform.
Pass these attributes as vertex attributes. This would require me to store as many copies of the same value as I have vertices (I can have up to a 100k in the arrays). Not an option if I do have to store them!
Is there a smarter way of doing this in OpenGL?
Say I have n number of primitives to draw.
The glMultiDrawArrays command already requires me to pass along two arrays of size n. One array (lineStartIndex) of start indices and one array (lineCount) storing how many vertices each array contains .
To me it seems as it should be possible to specify vertex array attributes in a similar manner. E.g. a arrayAttributes vector of size n that could also be passed along with the draw call.
So instead of vertexAttribArray I'd like something like vertexArrayAttribArray ;)
Btw, I am only using one VAO and one VBO.
To me it seems as it should be possible to specify vertex array attributes in a similar manner.
Why? Vertex attributes are for things that change per-vertex; why would you expect to be able to use them for things that don't change per-vertex?
Well, you can (as I will explain), but I don't know why you think it should be possible.
The instancing and base instance feature can be (ab)used to provide a uniform-like value to a shader in a draw call.
You need to set up the attribute you want to provide "per-line" as an instanced array attribute. So you would use glVertexAttribDivisor with a value of 1 for that attribute.
Then, for each line, you would call glDrawArraysInstancedBaseInstance. Each call represents a single line, so you would provide an instance count of 1. But the base instance represents the index in the attribute array for that line's "per-line" value.
Yes, you're not using multi-draw. But OpenGL's draw call overhead is minimal; it's combining draw calls with state changes that get you. And you're not changing state between calls, so there's no problem.
However, if you're concerned about this minor overhead, and have access to multi-draw indirect functionality, you can always use that. Each line is a separate draw in the multi-draw call.
Of course, base instance requires GL 4.2 or ARB_base_instance. If you don't have access to hardware of this nature, then you're going to have to stick with the uniform variable method.

How to send my model matrix only once per model to shaders

For reference, I'm following this tutorial. Now suppose I have a little application with multiple types of model, if I understand correctly I have to send my MPV matrix from the CPU to the GPU (in other words to my vertex shader) for each model, because each model might have a different model matrix from one to another.
Now looking at the tutorial and this post, I understand that the call to send the matrix to my shader (glUniformMatrix4fv(myMatrixID, 1, GL_FALSE, &myModelMVP[0][0])) should be done for each frame and for each model since each time it overwrites the previous value of my MVP (the one for my last model). But, being concerned about the performance of my app, I don't want to send useless data through the bus and if I understand correctly, my model matrix is constant for each model.
I'm thinking about having an uniform for each model's MVP matrix, but I think it is not scalable and I would also have to update all of them if my view or projection matrices changed... Is there a way to avoid sending multiple times my model matrices and only send my view and projection matrices upon change?
These are essentially two questions: how to avoid sending data when only part of a transformation sequence changes, and how to efficiently supply per-model data which may or may not have changed since the last frame.
Transformation Sequence
For the first, you have a transformation sequence. Your positions are in model space. You then conceptually transform them into world space, then to camera/view space, then finally to clip space, where you write the position to gl_Position.
Most of these transformations are constant throughout a frame, but may change on a frame-to-frame basis. So you want to avoid changing data that doesn't strictly need to be changed.
If you want to do this, then clearly you cannot provide an "MVP" matrix. That is, you should not have a single matrix that contains the whole transformation. You should instead have a matrix that represents particular parts of the transformation.
However, you will need to do this decomposition for reasons other than performance. You cannot do many lighting operations in clip-space; as a non-linear space, it messes up lots of lighting operations. Therefore, if you're going to do lighting at all, you need a transformation that stops before clip space.
Camera/view space is the most common stopping point for lighting computations.
Now, if you use model-to-camera and camera-to-clip, then the model-to-camera matrix for every model will change when the camera changes, even if the model itself has not moved. And therefore, you may need to upload a bunch of matrices that don't strictly need to be changed.
To avoid that, you would need to use model-to-world and world-to-clip (in this case, you do your lighting in world space). The issue here is that you are exposed to the perils of world space Numerical precision may become problematic.
But is there a genuine performance issue here? Obviously it somewhat depends on the hardware. However, consider that many applications have hundreds if not thousands of objects, each with matrices that change every frame. An animated character usually has over a hundred matrices just for themselves that change every frame.
So it seems unlikely that the performance cost of uploading a few matrices that could have been constant is a real-world problem.
Per-Object Storage
What you really want is to separate your storage of per-object data from the program object itself. This can be done with UBOs or SSBOs; in both cases, you're storing uniform data in buffer objects.
The former are typically smaller in size (64KB or so), while the latter are essentially unbounded in their storage (16MB minimum). Obviously, the former are typically faster to access, but SSBOs shouldn't be considered to be slow.
Each object would have a section of the buffer that gets used for per-object data. And thus, you could choose to change it or not as you see fit.
Even so, such a system does not guarantee faster performance. For example, if the implementation is still reading from that buffer from last frame when you try to change it this frame, the implementation will have to either allocate new memory or just wait until the GPU is finished. This is not a hypothetical possibility; GPU rendering for complex scenes frequently lags a frame behind the CPU.
So to avoid that, you would need to double-buffer your per-object data. But when you do that, you will have to always upload their data, even if it doesn't change. Why? Because it might have changed two frames ago, and your double buffer has old data in it.
Basically, your goal of trying to avoid uploading of sometimes-static per-model data is just as likely to harm performance as to help it.
First of all, it's likely that at least something in your scene moves. If it is the objects then the model matrix will change from frame to frame, if it is the camera then the view or projection matrix will change. MVP includes the composition of the three, so it actually will change anyways and you can't get away from updating it in one way or the other.
However, you may still benefit from employing some of these:
Use Uniform Buffer Objects. You can send the uniforms to the GPU only once, and then rebind the buffer that the program will read the uniforms from. So different models may use different UBOs for their parameters (like model matrix).
Use Instancing. Even if you render only one instance of every model, you can pass the model matrix as an instanced vertex attribute. It will be stored in the VAO, and so sent to the GPU only once (or when you have to update it). On the plus side you may now easily render multiple instances of the same model through instanced draw calls.
Note it might be beneficial to separate the model, view and projection matrices. View and projection might be passed through a 'camera description' uniform buffer object updated only once per frame, then referenced by all programs. Model matrix, if it isn't changed, then will be constant within the VAO. To do proper lighting you have to separate model-view from projection anyways. It might look intimidating to work with three matrices on the GPU, but you actually don't have to, as you may switch to quaternion-based pipeline instead, which in turn simplifies such things like tangent space interpolation.
Two words: Premature Optimization!
I don't want to send useless data through the bus and if I understand correctly, my model matrix is constant for each model.
The amount of data transmitted is insignificant. A 4×4 matrix of single precision floats takes up 64 bytes. For all intents and purposes this is practically nothing. Heck it takes more data to issue the actual drawing commands to the GPU (and usually uniform value changes are packed into the same bus transaction as the drawing commands).
I'm thinking about having an uniform for each model's MVP matrix
Then you're going to run out of uniforms. There's only so many uniform locations a GPU is required to support. You could of course use a uniform buffer object, but that's hardly the right application for that.

Multiple meshes or multiple objects in a single mesh?

I'm trying to load multiple objects into a vbo in opengl. If I want to be able to move these objects independently should I use a mesh for each object or should I load all the objects to a single mesh?
Also in my code I have...
loc1 = glGetAttribLocation(shaderP, "vertex_position");
Now I understand that this gets the vertex positions in my current program but if I want to load another object I load the mesh and then how can I get the vertex positions again but for only that mesh?
The answer is, as often, "it depends". Having one "mesh" (i.e. one buffer) per object is arguably "cleaner" but it is also likely slower. One buffer per object will make you bind a different buffer much more often. Tiny vertex buffer objects (a few dozen vertices) are as bad as huge ones (gigabytes of data). You should try to find a "reasonable" thing in between
The as of version 3.2 readily available glDrawElementsBaseVertex (also exists as instanced version) will allow you to seamlessly draw several objects or pieces from one buffer without having to fiddle with renumbering indices, and without having to switch buffers.
You should preferrably (presuming OpenGL 3.3 availability) not use glGetAttribLocation at all, but assign the attribute to a location using the layout specifier. That way you know the location, you don't need to ask every time, and you don't have to worry that "weird, unexpected stuff" might happen.
If you can't do it in the shader, use glBindAttribLocation (available since version 2.0) instead. It is somewhat less comfortable, but does the same thing. You get to decide the location instead of asking for it and worrying that the compiler hopefully didn't change the order for two different shaders.
It's generally cleaner if you use different buffers for different objects.
According to:
http://www.opengl.org/sdk/docs/man/xhtml/glGetAttribLocation.xml
This will only returned the pointer to the location of the data. You use this to bind you vertex info to the program. When you render your other object you will bind to the vbo that stores the other vertices.

Why does barrier synchronize shared memory when memoryBarrier doesn't?

The following GLSL compute shader simply copies inImage to outImage. It is derived from a more complex post-processing pass.
In the first several lines of main(), a single thread loads 64 pixels of data into the shared array. Then, after synchronizing, each of the 64 threads writes one pixel to the output image.
Depending on how I synchronize, I get different results. I originally thought memoryBarrierShared() would be the correct call, but it produces the following result:
which is the same result as having no synchronization or using memoryBarrier() instead.
If I use barrier(), I get the following (desired) result:
The striping is 32 pixels wide, and if I change the workgroup size to anything less than or equal to 32, I get correct results.
What's going on here? Am I misunderstanding the purpose of memoryBarrierShared()? Why should barrier() work?
#version 430
#define SIZE 64
layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;
layout(rgba32f) uniform readonly image2D inImage;
uniform writeonly image2D outImage;
shared vec4 shared_data[SIZE];
void main() {
ivec2 base = ivec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy);
ivec2 my_index = base + ivec2(gl_LocalInvocationID.x,0);
if (gl_LocalInvocationID.x == 0) {
for (int i = 0; i < SIZE; i++) {
shared_data[i] = imageLoad(inImage, base + ivec2(i,0));
}
}
// with no synchronization: stripes
// memoryBarrier(); // stripes
// memoryBarrierShared(); // stripes
// barrier(); // works
imageStore(outImage, my_index, shared_data[gl_LocalInvocationID.x]);
}
The problem with image load store and friends is, that the implementation cannot be sure anymore that a shader only changes the data of it's dedicated output values (e.g. the framebuffer after a fragment shader). This applies even more so to compute shaders, which don't have a dedicated output, but only output things by writing data into writable store, like images, storage buffers or atomic counters. This may require manual synchronization between individual passes as otherwise the fragment shader trying to access a texture might not have the most recent data written into that texture with image store operations by a preceding pass, like your compute shader.
So it may be that your compute shader works perfectly, but it is the synchronization with the following display (or whatever) pass (that needs to read this image data somehow) that fails. For this purpose there exists the glMemoryBarrier function. Depending on how you read that image data in the display pass (or more precisely the pass that reads the image after the compute shader pass), you need to give a different flag to this function. If you read it using a texture, use GL_TEXTURE_FETCH_BARRIER_BIT​, if you use an image load again, use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT​, if using glBlitFramebuffer for display, use GL_FRAMEBUFFER_BARRIER_BIT​...
Though I don't have much experience with image load/store and manual memory snynchronization and this is only what I came up with theoretically. So if anyone knows better or you already use a proper glMemoryBarrier, then feel free to correct me. Likewise does this not need to be your only error (if any). But the last two points from the linked Wiki article actually address your use case and IMHO make it clear that you need some kind of glMemoryBarrier:
Data written to image variables in one rendering pass and read by the shader in a later pass need not use coherent variables or
memoryBarrier(). Calling glMemoryBarrier with the
SHADER_IMAGE_ACCESS_BARRIER_BIT​ set in barriers​ between passes is
necessary.
Data written by the shader in one rendering pass and read by another mechanism (e.g., vertex or index buffer pulling) in a later pass need
not use coherent variables or memoryBarrier(). Calling
glMemoryBarrier with the appropriate bits set in barriers​ between
passes is necessary.
EDIT: Actually the Wiki article on compute shaders says
Shared variable access uses the rules for incoherent memory access.
This means that the user must perform certain synchronization in order
to ensure that shared variables are visible.
Shared variables are all implicitly declared coherent​, so you don't
need to (and can't use) that qualifier. However, you still need to
provide an appropriate memory barrier.
The usual set of memory barriers is available to compute shaders, but
they also have access to memoryBarrierShared()​; this barrier is
specifically for shared variable ordering. groupMemoryBarrier()​
acts like memoryBarrier()​, ordering memory writes for all kinds of
variables, but it only orders read/writes for the current work group.
While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are
executing in lock-step. If you need to ensure that an invocation has
written to some variable so that you can read it, you need to
synchronize execution with the invocations, not just issue a memory
barrier (you still need the memory barrier though).
To synchronize reads and writes between invocations within a work group, you must employ the barrier()​ function. This forces an
explicit synchronization between all invocations in the work group.
Execution within the work group will not proceed until all other
invocations have reach this barrier. Once past the barrier()​, all
shared variables previously written across all invocations in the
group will be visible.
So this actually sounds like you need the barrier there and the memoryBarrierShared is not enough (though you don't need both, as the last sentence says). The memory barrier will just synchronize the memory, but it doesn't stop the execution of the threads to cross it. Thus the threads won't read any old cached data from the shared memory if the first thread has already written something, but they can very well reach the point of reading before the first thread has tried to write anything at all.
This actually fits perfectly to the fact that for 32 and below block sizes it works and that the first 32 pixels work. At least on NVIDIA hardware 32 is the warp size and thus the number of threads that operate in perfect lock-step. So the first 32 threads (well, every block of 32 threads) always work exactly parallel (well, conceptually that is) and thus they cannot introduce any race-conditions. This is also the case why you don't actually need any synchronization if you know you work inside a single warp, a common optimization.