I am trying to control behavior of fragment shader by calculating vertex count in geometry shader so that if I have a vertex stream of 1000 triangles ,when the count reaches 500 I set some varying for fragment shader which signals that the later must switch its processing.To count total vertices(or triangles) processed I use Atomic counter in geometry shader.I planned to do it in vertex shader first,but then I read somewhere that because of vertex caching counter won't increment on each vertex invocation.But now it seems that doing it in geometry shader doesn't execute the count precisely either.
In my geometry shader I am doing this:
layout(triangles) in;
layout (triangle_strip ,max_vertices = 3) out;
layout(binding=0, offset=0) uniform atomic_uint ac;
out flat float isExterior;
void main()
{
memoryBarrier();
uint counter = atomicCounter(ac);
float switcher = 0.0;
if (counter >= exteriorSize)
{
switcher = 2.0;
}
else
{
atomicCounterIncrement(ac);
atomicCounterIncrement(ac);
atomicCounterIncrement(ac);
}
isExterior = switcher;
// here just emitting primitive....
exteriorSize is a uniform holding a number equal to number of vertices in an array.When I read out the value of counter on CPU it never equals to exteriorSize.But it is almost 2 times smaller than it.Is there a vertex caching in geometry stage as well?Or am I doing something wrong?
Basically what I need is to tell fragment shader: "after vertex number X start doing work Y.As lont as vertex number is less than X do work Z" And I can't get that exact X from atomic counter even though I increment it up till it reach that limit.
UPDATE:
I suspect the problem is with atomic writes synchronization.If I set memoryBarrier in different places the counter values change.But I still can't get it return the exact value that equals to exteriorSize.
UPDATE 2:
Well,I didn't figure out the issue with atomic counter synchronization so I did it using indirect draw . Works like a charm.
The geometry shader executes per-primitive (triangle in this case), whereas the vertex shader executes per-vertex, almost. Using glDrawElements allows vertex results to be shared between triangles (e.g. indexing 0,1,2 then 0,2,3 uses 0 and 2 twice: 4 verts, 2 triangles and 6 references). As you say, a limited cache is used to share the results, so if the same vertex is referenced a long time later it has to be recomputed.
It looks like there's a potential issue with updates to the counter occurring between atomicCounter and atomicCounterIncrement. If you want an entire section of code like this to work, it needs to be locked. This can get very slow depending on what you're locking.
Instead, it's going to be far easier to always call atomicCounterIncrement and potentially allow ac to grow beyond exteriorSize.
AFAIK reading back values from the atomic counter buffer should stall until the memory operations have completed, but I've been caught out not calling glMemoryBarrier between passes before.
It sounds like exteriorSize should be equal to the number of triangles and not vertices if this is executing in the geometry shader. If instead you do want per-vertex processing, then maybe change to GL_POINTS or save the vertex shader results using the transform feedback extension and then drawing triangles from that (essentially doing the caching yourself but with a buffer that holds everything). If you use glDrawArrays or never reuse vertices then a standard vertex shader should be fine.
Lastly, calling atomicCounterIncrement three times is a waste. Call once and use counter * 3.
Related
So I want to have multiple light sources in my scene. The basic idea is to simply have an array of a (uniform) struct that has all the properties of light you care about such as positions, color, direction, cutoff and w/e you want. My problem is how to represent which lights are on/off? I will list out all the ways I can think of. Pl
Have a uniform int per light structure to indicate if it's on/off.
Have the number of light struct match multiples of 2, 3, or 4 such that I can use that many bool vectors to indicate their status. For example, 16 lights = 4x4 bvec4.
Instead of using many flags and branch, always go through every single light but with the off ones set to (0,0,0,0) for color
I'm leaning towards the last options as it won't have branching ... but I already read that modern graphics card are more okay with branching now.
None of your ideas is really good because all of them require the shader to evaluate which lightsources are active and which aren't. You should also differentiate between scene information (which lights are present in the scene) and data necessary for rendering (which lights are on and illuminate the scene). Scene information shouldn't be stored in a shader since it is unnecessary and will only slow down the shader.
A better way for handling multiple light sources in a scene and render with only the active ones could be as follows:
In every frame:
Evaluate on CPU side which lights are on. Pass only the lights which are on to the shader uniforms together with the total count of the active lights. In pseudocode:
Shader:
uniform lightsources active_lightsources[MAX_LIGHTS];
uniform int light_count;
CPU:
i = 0
foreach (light in lightsources)
{
if (light.state == ON)
{
set_uniform("active_lightsources[i]", light);
i++
}
}
set_uniform("light_count", i);
When illuminating a fragment do the following in the shader
Shader:
for (int i = 0; i < light_count; i++)
{
//Calculate illumination from active_lightsources[i]
}
The major advantage of this approach is that you store less lights inside the shader and that the loop inside the shader only loops over lightsources that are relevant for the current frame. The evaluation which lights are relevant is done once per frame on the CPU instead of once per vertex/fragment in the shader.
I send a VertexBuffer+IndexBuffer of GL_TRIANGLES via glDrawElements() to the GPU.
In the vertex shader I wanted snap some vertices to the same coordinates to simplify a large mesh on-the-fly.
As result I expeceted a major performance boost because a lot of triangle are collapsing to the same point and would be degenerated.
But I don't get any fps gain.
Due testing I set my vertex shader just to gl_Position(vec4(0)) to degenerate ALL triangles, but still no difference...
Is there any flag to "activate" the degeneration or what am I'm missing?
glQuery of GL_PRIMITIVES_GENERATED also prints always the number of all mesh faces.
What you're missing is how the optimization you're trying to use actually works.
The particular optimization you're talking about is post-caching of T&L. That is, if the same vertex is going to get processed twice, you only process it once and use the results twice.
What you don't understand is how "the same vertex" is actually determined. It isn't determined by anything your vertex shader could compute. Why? Well, the whole point of caching is to avoid running the vertex shader. If the vertex shader was used to determine if the value was already cached... you've saved nothing since you had to recompute it to determine that.
"The same vertex" is actually determined by matching the vertex index and vertex instance. Each vertex in the vertex array has a unique index associated with it. If you use the same index twice (only possible with indexed rendering of course), then the vertex shader would receive the same input data. And therefore, it would produce the same output data. So you can use the cached output data.
Instance ids also play into this, since when doing instanced rendering, the same vertex index does not necessarily mean the same inputs to the VS. But even then, if you get the same vertex index and the same instance id, then you would get the same VS inputs, and therefore the same VS outputs. So within an instance, the same vertex index represents the same value.
Both the instance count and the vertex indices are part of the rendering process. They don't come from anything the vertex shader can compute. The vertex shader could generate the same positions, normals, or anything else, but the actual post-transform cache is based on the vertex index and instance.
So if you want to "snap some vertices to the same coordinates to simplify a large mesh", you have to do that before your rendering command. If you want to do it "on the fly" in a shader, then you're going to need some kind of compute shader or geometry shader/transform feedback process that will compute the new mesh. Then you need to render this new mesh.
You can discard a primitive in a geometry shader. But you still had to do T&L on it. Plus, using a GS at all slows things down, so I highly doubt you'll gain much performance by doing this.
I'm rendering a line that is composed of triangles in OpenGL.
Right now I have it working where:
Vertex buffer: {v0, v1, v2, v3}
Index buffer (triangle strip): {0, 1, 2, 3}
The top image is the raw data passed into the vertex shader and the bottom is the vertex shader output after applying an offset to v1 and v3 (using a vertex attribute).
My goal is to use one vertex per point on the line and generate the offset some other way. I was looking at gl_VertexID, but I want something more like an element ID. Here's my desired setup:
Vertex buffer: {v0, v2}
Index buffer (triangle strip): {0, 0, 1, 1}
and use an imaginary gl_ElementID % 2 to offset every other vertex.
I'm trying to avoid using geometry shaders or additional vertex attributes. Is there any way of doing this? I'm open to completely different ideas.
I can think of one way to avoid the geometry shader and still work with a compact representation: instanced rendering. Just draw many instances of one quad (as a triangle strip), and define the two positions as per-instance attributes via glVertexAttribDivisor().
Note that you don't need a "template quad" with 4 vertices at all. You just need conceptually two attributes, one for your start point, and one for your end point. (If you work in 2D, you can fuse that into one vec4, of course). In each vertex shader invocation, you will have access to both points, and can construct the final vertex position based on that and the value of gl_VertexID (which will only be in range 0 to 3). That way, you can get away with exactly that vertex array layout of two points per line segment you are aiming for, and still only need a single draw call and a vertex shader.
No, that is not possible, because each vertex is only processed once. So if you're referencing a vertex 10 times with an index buffer, the corresponding vertex shader is still only executed one time.
This is implemented in hardware with the Post Transform Cache.
In the absolute best case, you never have to process the same vertex
more than once.
The test for whether a vertex is the same as a previous one is
somewhat indirect. It would be impractical to test all of the
user-defined attributes for inequality. So instead, a different means
is used.
Two vertices are considered equal (within a single rendering command)
if the vertex's index and instance count are the same (gl_VertexID​
and gl_InstanceID​ in the shader). Since vertices for non-indexed
rendering are always increasing, it is not possible to use the post
transform cache with non-indexed rendering.
If the vertex is in the post transform cache, then that vertex data is
not necessarily even read from the input vertex arrays again. The
process skips the read and vertex shader execution steps, and simply
adds another copy of that vertex's post-transform data to the output
stream.
To solve your problem I would use a geometry shader with a line (or line strip) as input and a triangle strip as output. With this setup you could get rid of the index buffer, since it's only working on lines.
I am trying to make a particle system where instead of a texture, a quad is rendered by a fragment shader such as below.
uniform vec3 color;
uniform float radius;
uniform float edge;
uniform vec2 position;
uniform float alpha;
void main()
{
float dist = distance(gl_FragCoord.xy, position);
float intensity = smoothstep(dist-edge, dist+edge, radius);
gl_FragColor = vec4(color, intensity*alpha);
}
Each particle is an object of a c++ class that wraps this shader and all the variables together and draws it. I use openFrameworks so the exact openGL calls are hidden from me.
I've read that usually particle systems are done with textures, however, I prefer to do it like this because this way I can add more functionality to the particles. The problem is that after only 30 particles, the framerate drops dramatically. Is there a more efficient way of doing this? I was thinking of maybe putting the variables for each particle into an array and sending these arrays into one fragment shader that then renders all particles in one go. But this would mean that the amount of particles would be fixed because the uniform arrays in the shader would have to be declared beforehand.
Are non-texture-based particle systems simply too inefficient to be realistic, or is there a way of designing this that I'm overlooking?
The reason textures are used is because you can move the particles using the GPU, which is very fast. You'd double buffer a texture which stores particle attributes (like position) per texel, and ping-pong data between them, using a framebuffer object to draw to them and the fragment shader to do the computation, rendering a full screen polygon. Then you'd draw an array of quads and read the texture to get the positions.
Instead of a texture storing attributes you could pack them directly into your VBO data. This gets complicated because you have multiple vertices per particle, but can still be done a number of ways. glVertexBindingDivisor (requires instancing), drawing points, or using the geometry shader come to mind. Transform feedback or image_load_store could be used to update VBOs with the GPU instead of textures.
If you move particles with the CPU, you also need to copy the data to the GPU every frame. This is quite slow, but nothing like 30 particles being a problem slow. This is probably to do with the number of draw calls. Each time you draw something there's a tonne of stuff GL does to set up the operation. Setting uniform values per-primitive (nearly) is very expensive for the same reason. Particles work well if you have arrays of data that gets processed by a manager all at once. They parallelize very well in such cases. Generally their computation is cheap and it all comes down to minimizing memory and keeping good locality.
If you want to keep particle updating CPU side, I'd go with this:
Create a VBO full of -1 to 1 quads (two triangles, 6 verts) and element array buffer to draw them. This buffer will remain static in GPU memory and is what you use to draw the particles all at once with a single draw call.
Create a texture (could be 1D) or VBO (if you choose one of the above methods) that contains positions and particle attributes that update pretty much every frame (using glTexImage1D/glBufferData/glMapBuffer).
Create another texture with particle attributes that rarely update (e.g. only when you spawn them). You can send updates with glTexSubImage1D/glBufferSubData/glMapBufferRange.
When you draw the particles, read position and other attributes from the texture (or attributes if you used VBOs) and use the -1 to 1 quads in the main geometry VBO as offsets to your position.
Which is the best way, from a performance point of view, to perform a (weighted) sum of the content of two texture? I'm fine with either perform this on CPU or GPU as long is a fast method. Note that this must be repeated multiple times, so it's not just a one shot sum of two.
In particular I'm interested in a weighted sum of several texture, but I believe this can be easily generalized from the sum of two.
EDIT:
I'll make more clear my goal. I've to generate several texture (sequentially) with various blurring, therefore these texture will be all generated by rendering on texture. The number of them I don't think will ever be more than 8/9.
At the end the result must displayed on screen.
So if understand the question correctly, you render into some textures, and then need a weighted sum over all of those textures, and want to display just that image. If so, you could just do one extra rendering pass, while having all of your textures bound, and just calculate the weighted sum of all textures in the fragment shader. Since you do not need the result as a texutre, you could directly render into the default framebuffer, so the result should become immediately visible.
With up to 9 textures you need the most, you can actually follow that strategy, since there will be enough texture units. However, that approach might be a bit inflexible, especially if you have to deal with a varying number of textures to sum up at different points in time.
It would be nice if you could just have a uniform variable with the count, and array of weight values, and a loop in the shader which would boil down to
uniform int count;
uniform float weights[MAX_COUNT];
uniform sampler2D uTex[MAX_COUNT];
[...]
for (i=0; i<count; i++)
sum += weight[i] * texture(uTex[i], texcoords);
And you can do that beginning with GL 4. It does support arrays of texture samplers, but requires that the access index is dynamically uniform, which means that all shader invocations are going to access the same texture samplers at the same time. As the loop only depends on a uniform variable, this is the case.
However, it might be a better strategey to just not use multiple textures. Assuming all of your input textures have the same resolution, you might be better off using just one texture array. You can attach a layer of such an array texture to an FBO as you can do with a ordinary 2D texture, so rendering to them indepedently (or rendering to multiple layers at a time using multiple render targets) will just work. You then only need to bind that single array texture and can do
uniform int count;
uniform float weights[MAX_COUNT];
uniform sampler2Darray uTex;
[...]
for (i=0; i<count; i++)
sum += weight[i] * texture(uTex, vec3(texcoords,i));
This only requires GL3 level hardware and the maximum count you can work with is not limited by the number of texture units available to the texture shader, but tby the texture array limit (typically > 256) and the available memory. However, the performance will go down if count gets too high. You might reach some point where actually using multiple passes where you only sum up a certain sub-range of your images becomes more efficient, due to the texture cache. In this approach, all the texture accesses of the different layers compete for the texture cache, negatively impacting the cache hit rate between neighboring fragments. But this should be no issue with just 8 or 9 input images.