I checked something and I got weird result about performance with C++ & OpenGL & GLSL.
In the first program I drew pixels to texture with fragment shader and then render the texture.
The texture's mag\min was GL_NEAREST.
In the second program I took the fragment shader and rendered directly to the screen.
Why the second program is faster? Isn't rendering texture faster instead of repeating the same action?
It's like taking a video of AAA game and then show it on the same computer and get lower FPS with the video.
The fragment shader is:
uniform int mx,my;
void main(void) {
vec2 p=gl_FragCoord.xy;
p-=vec2(mx,my);
if (p.x<0.0)
p.x=-p.x;
if (p.y<0.0)
p.y=-p.y;
float dis=sqrt(p.x*p.x+p.y*p.y);
dis+=(abs(p.x)+abs(p.y))-(abs(p.x)-abs(p.y));
p.x/=dis;
p.y/=dis;
gl_FragColor=vec4(p.x,p.y,0.0,1.0);
}
As usual with performance questions, about the only way to be really certain would be to use a profiler.
That said, my guess would be that this is mostly a question of processing bandwidth versus memory bandwidth. To render a texture, the processor has to read data from one part of memory, and write that same data back to another part of memory.
To directly render from the shader, the processor only has to write the output to memory, but doesn't have to read data in from memory.
Therefore, it's a question of which is faster: reading that particular data from memory, or generating it with the processing units? The math in your shader is pretty simple (essentially the only part that's at all complex is the sqrt) -- so at least with your particular hardware, it appears that it's a little faster to compute the result than read it from memory (at least given the other memory accesses going on at the same time, etc.)
Note that the two (shader vs. texture) have quite different characteristics. Reading a texture is going to be nearly constant speed, regardless of how simple or complex of computation was involved in creating it. Not to state the obvious, but a shader is going to run fast if the computation is simple, but slow down (potentially a lot) if the computation as the computation gets complex. In the AAA games you mention, it's fair to guess that at least some of the shaders use complex enough calculations that they'll almost certainly be slower than a texture read. At the opposite extreme a really trivial shader (e.g., one that just passes the fragment color through from input to output) is probably quite a lot faster than reading from a texture.
Related
I'm using a color palette of 5 colors in my game, and every time I am passing every single color as a uniform vec3 to the program. Would it be more efficient if I was using a one dimensional texture that contains all 5 colors (15 floats in the texture)?
That's just one of the situations where I would like to do this kind of thing. Another would be to send all the matrices/variables at once to the shader program. I seems a little bit inefficient to send every variable, one at the time, every time I want to render. Would it be better to group them all in a single Texture and send them all at once?
Is there maybe another, even more efficient way of doing what I'm trying to do?
Would it be more efficient if I was using a one dimensional texture that contains all 5 colors (15 floats in the texture)?
No, texture reads are most likely going to have less performance, but in your case it shouldn't make much difference.
Is there maybe another, even more efficient way of doing what I'm trying to do?
Well, if they are always the same as you say, then just put them as constants in shaders.
That's just one of the situations where I would like to do this kind of thing. Another would be to send all the matrices/variables at once to the shader program. I seems a little bit inefficient to send every variable, one at the time, every time I want to render. Would it be better to group them all in a single Texture and send them all at once?
Modifying textures is not going to be any faster then setting a few uniforms.
And if you are going to use your matrices per-vertex or per-fragment, then it will cause a lot of texture reads. That might actually cause a significant performance drop, depending on the amount of vertices/fragments/matrices you have. And even if that texture data ends up in L1 texture cache, it still won't outperform uniform reads.
If you don't want to send all the variables independently, you could use Uniform Buffer Objects.
In my application I have some shaders which write only depth buffer to use it later for shadowing. Also I have some other shaders which render a fullscreen quad whose depth will not affect all subsequent draw calls, so it's depth values may be thrown away.
Assuming the application runs on modern hardware (produced 5 years ago till now), will I gain any additional performance if I disable color buffer writing (glColorMask(all to GL_FALSE)) for shadow map shaders, and depth buffer writing (with glDepthMask()) for fullscreen quad shaders?
In other words, do these functions really disable some memory operations or they just alter some mask bits which are used in fixed bitwise-operations logic in this part of rendering pipeline?
And the same question about testing. If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
My FPS measurement don't show any significant difference, but the result may be different on another machine.
Finally, if rendering runs faster with depth/color test/write disabled, how much faster does it run? Wouldn't this performance gain be negated by gl functions call overhead?
Your question is missing a very important thing: you have to do something.
Every fragment has color and depth values. Even if your FS doesn't generate a value, there will still be a value there. Therefore, every fragment produced that is not discarded will write these values, so long as:
The color is routed to a color buffer via glDrawBuffers.
There is an appropriate color/depth buffer attached to the FBO.
The color/depth write mask allows it to be written.
So if you're rendering and you don't want to write one of those colors or to the depth buffer, you've got to do one of these. Changing #1 or #2 is an FBO state change, which is among the most heavyweight operations you can do in OpenGL. Therefore, your choices are to make an FBO change or to change the write mask. The latter will always be the more performance-friendly operation.
Maybe in your case, your application doesn't stress the GPU or CPU enough for such a change to matter. But in general, changing write masks are a better idea than playing with the FBO.
If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
Are you changing other state at the same time, or is that the only state you're interested in?
One good way to look at these kinds of a priori performance questions is to look at Vulkan or D3D12 and see what it would require in that API. Changing any pipeline state there is a big deal. But changing two pieces of state is no bigger of a deal than one.
So if changing the depth test correlates with changing other state (blend modes, shaders, etc), it's probably not going to hurt any more.
At the same time, if you really care enough about performance for this sort of thing to matter, you should do application testing. And that should happen after you implement this, and across all hardware of interest. And your code should be flexible enough to easily switch from one to the other as needed.
I was recently looking for ray tracing via opengl tutorials. Most of tutorials prefer compute shaders. I wonder why don't they just render to texture, then render the texture to screen as quad.
What is the advantages and disadvantages of compute shader method over screen quad?
Short answer: because compute shaders give you more effective tools to perform complex computations.
Long answer:
Perhaps the biggest advantage that they afford (in the case of tracing) is the ability to control exactly how work is executed on the GPU. This is important when you're tracing a complex scene. If your scene is trivial (e.g., Cornell Box), then the difference is negligible. Trace some spheres in your fragment shader all day long. Check http://shadertoy.com/ to witness the madness that can be achieved with modern GPUs and fragment shaders.
But. If your scene and shading are quite complex, you need to control how work is done. Rendering a quad and doing the tracing in a frag shader is going to, at best, make your application hang while the driver cries, changes its legal name, and moves to the other side of the world...and at worst, crash the driver. Many drivers will abort if a single operation takes too long (which virtually never happens under standard usage, but will happen awfully quickly when you start trying to trace 1M poly scenes).
So you're doing too much work in the frag shader...next logical though? Ok, limit the workload. Draw smaller quads to control how much of the screen you're tracing at once. Or use glScissor. Make the workload smaller and smaller until your driver can handle it.
Guess what we've just re-invented? Compute shader work groups! Work groups are compute shader's mechanism for controlling job size, and they're a far better abstraction for doing so than fragment-level hackery (when we're dealing with this kind of complex task). Now we can very naturally control how many rays we dispatch, and we can do so without being tightly-coupled to screen-space. For a simple tracer, that adds unnecessary complexity. For a 'real' one, it means that we can easily do sub-pixel raycasting on a jittered grid for AA, huge numbers of raycasts per pixel for pathtracing if we so desire, etc.
Other features of compute shaders that are useful for performant, industrial-strength tracers:
Shared Memory between thread groups (allows, for example, packet tracing, wherein an entire packet of spatially-coherent rays are traced at the same time to exploit memory coherence & the ability to communicate with nearby rays)
Scatter Writes allow compute shaders to write to arbitrary image locations (note: image and texture are different in subtle ways, but the advantage remains relevant); you no longer have to trace directly from a known pixel location
In general, the architecture of modern GPUs are designed to support this kind of task more naturally using compute. Personally, I have written a real-time progressive path tracer using MLT, kd-tree acceleration, and a number of other computationally-expensive techniques (PT is already extremely expensive). I tried to remain in a fragment shader / full-screen quad as long as I could. Once my scene was complex enough to require an acceleration structure, my driver started choking no matter what hackery I pulled. I re-implemented in CUDA (not quite the same as compute, but leveraging the same fundamental GPU architectural advances), and all was well with the world.
If you really want to dig in, have a glance at section 3.1 here: https://graphics.cg.uni-saarland.de/fileadmin/cguds/papers/2007/guenther_07_BVHonGPU/Guenter_et_al._-_Realtime_Ray_Tracing_on_GPU_with_BVH-based_Packet_Traversal.pdf. Frankly the best answer to this question would be an extensive discussion of GPU micro-architecture, and I'm not at all qualified to give that. Looking at modern GPU tracing papers like the one above will give you a sense of how deep the performance considerations go.
One last note: any performance advantage of compute over frag in the context of raytracing a complex scene has absolutely nothing to do with rasterization / vertex shader overhead / blending operation overhead, etc. For a complex scene with complex shading, bottlenecks are entirely in the tracing computations, which, as discussed, compute shaders have tools for implementing more efficiently.
I am going to complete Josh Parnell information.
One problem with both fragment shader and compute shader is that they both lack recursivity.
A ray tracer is recursive by nature (yeah I know it is always possible to transform a recursive algorithm in a non recursive one, but is is not always that easy to do it).
So another way to see the problem could be the following :
Instead to have "one thread" per pixel, one idea could be to have one thread per path (a path is a part of your ray (between 2 bounces)).
Going that way, you are dispatching on your "bunch" of rays instead on your "pixel grid". Doing so simplify the potential recursivity of the ray tracer, and avoid divergence in complex materials :
More information here :
http://research.nvidia.com/publication/megakernels-considered-harmful-wavefront-path-tracing-gpus
I need to draw hundreds of semi-transparent circles as part of my OpenCL pipeline.
Currently, I'm using OpenGL (with alpha blend), synced (for portability) using clFinish and glFinish with my OpenCL queue.
Would it be faster to do this rendering task in OpenCL? (assuming the rest of the pipeline is already in OpenCL, and may run on CPU if a no OpenCL-compatible GPU is available).
It's easy replace the rasterizer with a simple test function in the case of a circle. The blend function requires a single read from the destination texture per fragment. So a naive OpenCL implementation seems to be theoretically faster. But maybe OpenGL can render non-overlapping triangles in parallel (this would be harder to implement in OpenCL)?
Odds are good that OpenCL-based processing would be faster, but only because you don't have to deal with CL/GL interop. The fact that you have to execute a glFinish/clFinish at all is a bottleneck.
This has nothing to do with fixed-function vs. shader hardware. It's all about getting rid of the synchronization.
Now, that doesn't mean that there aren't wrong ways to use OpenCL to render these things.
What you don't want to do is write colors to memory with one compute operation, then read from another compute op, blend, and write them back out to memory. That way lies madness.
What you ought to do instead is effectively build a tile-based renderer internally. Each workgroup will represent some count of pixels (experiment to determine the best count for performance). Each invocation operates on a single pixel. They'll use their pixel position, do the math to determine whether the pixel is within the circle (and how much of it is within the circle), then blend that with a local variable the invocation keeps internally. So each invocation processes all of the circles, only writing their pixel's worth of data out at the very end.
Now if you want to be clever, you can do culling, so that each work group is given only the circles that are guaranteed to affect at least some pixel within their particular area. That is effectively a preprocessing pass, and you could even do that on the CPU, since it's probably not that expensive.
I am confused about what's faster versus what's slower when it comes to coding algorithms that execute in the pipeline.
I made a program with a GS that seemingly bottlenecked from fillrate, because timer queries showed it to execute much faster with no rasterisation enabled.
So then I made a different multi-pass algorithm using transform feedback, still using a GS every time but theoretically does much less work overall by executing in stages, and it significantly reduces the fill rate because it renders much less triangles, but in my early tests of it, it appears to run slower.
My original thought was that the bottleneck of fillrate was traded for the bottleneck of calling multiple draw calls. But how expensive is another draw call really? How much overhead is involved in the cpu and gpu?
Then I read the answer of a different stack question regarding the GS:
No one has ever accused Geometry Shaders of being fast. Especially when increasing the size of geometry.
Your GS is taking a line and not only doing a 30x amplification of vertex data, but also doing lighting computations on each of those new vertices. That's not going to be terribly fast, in large part due to a lack of parallelism. Each GS invocation has to do 60 lighting computations, rather than having 60 separate vertex shader invocations doing 60 lighting computations in parallel.
You're basically creating a giant bottleneck in your geometry shader.
It would probably be faster to put the lighting stuff in the fragment shader (yes, really).
and it makes me wonder how it's possible for a geometry shaders to be slower if their use provides an overall less work output. I know things execute in parallel, but my understanding is that there is only a relatively small group of shader cores, so starting an amount of threads much larger than that group will result in the bottleneck being something proportional to program complexity (instruction size) times the number of threads (using thread here to refer to invocation of a shader). If you can have some instruction execute once per vertex on the geometry shader instead of once per fragment, why would it ever be slower?
Help me gain a better understanding so I don't waste time designing algorithms that are inefficient.