GPGPU - effective ping-pong technique? - glsl

I'm trying to implement effective fluid solver on the GPU using WebGL and GLSL shader programming.
I've found interesting article:
http://http.developer.nvidia.com/GPUGems/gpugems_ch38.html
See: 38.3.2 Slab Operations
I'm wondering if this technique of enforcing boundary conditions is possible with ping-pong rendering?
If I render only lines, what about an interior of the texture?
I've always assumed that the whole input texture must be copied to temporary texture (ofc boundary is updated during that process), as they are swapped after that operation.
This is interesting especially considering the fact, that Example 38-5. The Boundary Condition Fragment Program (visualization: http://i.stack.imgur.com/M4Hih.jpg) shows scheme that IMHO requires ping-pong technique.
What do you think? Do I misunderstand something?
Generally I've found that texture write is extremely costly and that's why I would like to limit it somehow. Unfortunately, the ping-pong technique enforces a lot of texture writes.

I've actually implemented the technique described in that chapter using FrameBuffer objects as the render to texture method (but in desktop OpenGL since WebGL didn't exist at the time), so it's definitely possible. Unfortunately I don't believe I have the code any more, but if you tag any future questions you have with [webgl] I'll see if I can provide some help.
You will need to ping-pong several times per frame (the article mentions five steps, but I seem to recall the exact number depends on the quality of the simulation you want and on your exact boundary conditions). Using FBOs is quite a bit more efficient than it was when this was written (the author mentions using a GeForce FX 5950, which was a while ago), so I wouldn't worry about the overhead he mentions in the article. As long as you aren't bringing data back to the CPU, you shouldn't find too high a cost for switching between texture and the framebuffer.
You will have some leakage if your boundaries are only a pixel thick, but that may or may not be acceptable depending on how you render your results and the velocity of your fluid. Making the boundaries thicker may help, and there are papers that have been written since this one that explore different ways of confining the fluid within boundaries (I also recall a few on more efficient diffusion/pressure solvers that you might check out after you have this version working...you'll find some interesting follow ups if you search for papers that cite the GPU gems article on google scholar).
Addendum: I'm not sure I entirely understand your question about boundaries. The key is that you must run a shader at each pixel of what you want to be a boundary, but it doesn't really matter how that pixel gets there, whether it's drawn with lines, points, or triangles (as long as its inputs are correct).
In the very general case (which might not apply if you only have a limited number of boundary primitives), you will likely have to draw a framebuffer-covering quad, since the interactions with the velocity and pressure fields are more complicated (any surrounding pixel could be another boundary pixel, instead of having simply defined edges). See section 38.5.4 (Arbitrary Boundaries) for some explanation of how to do it. If something isn't a boundary, you won't touch the vector field, and if it is, instead of hardcoding which directions you want to look in to sum vector values, you'll probably end up testing the surrounding pixels and only summing the ones that aren't boundaries so that you can enforce the boundary conditions.

Related

Ray tracing via Compute Shader vs Screen Quad

I was recently looking for ray tracing via opengl tutorials. Most of tutorials prefer compute shaders. I wonder why don't they just render to texture, then render the texture to screen as quad.
What is the advantages and disadvantages of compute shader method over screen quad?
Short answer: because compute shaders give you more effective tools to perform complex computations.
Long answer:
Perhaps the biggest advantage that they afford (in the case of tracing) is the ability to control exactly how work is executed on the GPU. This is important when you're tracing a complex scene. If your scene is trivial (e.g., Cornell Box), then the difference is negligible. Trace some spheres in your fragment shader all day long. Check http://shadertoy.com/ to witness the madness that can be achieved with modern GPUs and fragment shaders.
But. If your scene and shading are quite complex, you need to control how work is done. Rendering a quad and doing the tracing in a frag shader is going to, at best, make your application hang while the driver cries, changes its legal name, and moves to the other side of the world...and at worst, crash the driver. Many drivers will abort if a single operation takes too long (which virtually never happens under standard usage, but will happen awfully quickly when you start trying to trace 1M poly scenes).
So you're doing too much work in the frag shader...next logical though? Ok, limit the workload. Draw smaller quads to control how much of the screen you're tracing at once. Or use glScissor. Make the workload smaller and smaller until your driver can handle it.
Guess what we've just re-invented? Compute shader work groups! Work groups are compute shader's mechanism for controlling job size, and they're a far better abstraction for doing so than fragment-level hackery (when we're dealing with this kind of complex task). Now we can very naturally control how many rays we dispatch, and we can do so without being tightly-coupled to screen-space. For a simple tracer, that adds unnecessary complexity. For a 'real' one, it means that we can easily do sub-pixel raycasting on a jittered grid for AA, huge numbers of raycasts per pixel for pathtracing if we so desire, etc.
Other features of compute shaders that are useful for performant, industrial-strength tracers:
Shared Memory between thread groups (allows, for example, packet tracing, wherein an entire packet of spatially-coherent rays are traced at the same time to exploit memory coherence & the ability to communicate with nearby rays)
Scatter Writes allow compute shaders to write to arbitrary image locations (note: image and texture are different in subtle ways, but the advantage remains relevant); you no longer have to trace directly from a known pixel location
In general, the architecture of modern GPUs are designed to support this kind of task more naturally using compute. Personally, I have written a real-time progressive path tracer using MLT, kd-tree acceleration, and a number of other computationally-expensive techniques (PT is already extremely expensive). I tried to remain in a fragment shader / full-screen quad as long as I could. Once my scene was complex enough to require an acceleration structure, my driver started choking no matter what hackery I pulled. I re-implemented in CUDA (not quite the same as compute, but leveraging the same fundamental GPU architectural advances), and all was well with the world.
If you really want to dig in, have a glance at section 3.1 here: https://graphics.cg.uni-saarland.de/fileadmin/cguds/papers/2007/guenther_07_BVHonGPU/Guenter_et_al._-_Realtime_Ray_Tracing_on_GPU_with_BVH-based_Packet_Traversal.pdf. Frankly the best answer to this question would be an extensive discussion of GPU micro-architecture, and I'm not at all qualified to give that. Looking at modern GPU tracing papers like the one above will give you a sense of how deep the performance considerations go.
One last note: any performance advantage of compute over frag in the context of raytracing a complex scene has absolutely nothing to do with rasterization / vertex shader overhead / blending operation overhead, etc. For a complex scene with complex shading, bottlenecks are entirely in the tracing computations, which, as discussed, compute shaders have tools for implementing more efficiently.
I am going to complete Josh Parnell information.
One problem with both fragment shader and compute shader is that they both lack recursivity.
A ray tracer is recursive by nature (yeah I know it is always possible to transform a recursive algorithm in a non recursive one, but is is not always that easy to do it).
So another way to see the problem could be the following :
Instead to have "one thread" per pixel, one idea could be to have one thread per path (a path is a part of your ray (between 2 bounces)).
Going that way, you are dispatching on your "bunch" of rays instead on your "pixel grid". Doing so simplify the potential recursivity of the ray tracer, and avoid divergence in complex materials :
More information here :
http://research.nvidia.com/publication/megakernels-considered-harmful-wavefront-path-tracing-gpus

texture(...) via textureoffset(...) performance in glsl

Does utilizing textureOffset(...) increase performance compared to calculating offsets manually and using regular texture(...) function?
As there is a GL_MAX_PROGRAM_TEXEL_OFFSET property, I would guess that it can fetch offseted texels in a single, or at least as few as possible, fetches making it superb for example blurring effects, but I cant seem to find out how it works internally anywhere?
Update:
Reformulating question: is it common among gl-drivers to make any optimizations regarding texture fetches when utilizing the textureOffset(...) function?
You're asking the wrong question. The question should not be whether the more specific function will always have better performance. The question is whether the more specific function will ever be slower.
And there's no reason to expect it to be slower. If the hardware has no specialized functionality for offset texture accesses, then the compiler will just offset the texture coordinate manually, exactly like you could. If there is hardware to help, then it will use it.
So if you have need of textureOffsets and can live within its limitations, there's no reason not to use it.
I would guess that it can fetch offseted texels in a single, or at least as few as possible, fetches making it superb for example blurring effects
No, that's textureGather. textureOffset is for doing exactly what its name says: accessing a texture based on a texture coordinate, with an texel offset from that coordinate's location.
textueGather samples from multiple neighboring texels all at once. If you need to read a section of a texture to do bluring, textureGather (and textureGatherOffset) are going to be more useful than textureOffset.

OpenGL: Are degenerate triangles in a Triangle Strip acceptable outside of OpenGL-ES?

In this tutorial for OpenGL ES, techniques for optimizing models are explained and one of those is to use triangle strips to define your mesh, using "degenerate" triangles to end one strip and begin another without ending the primitive. http://www.learnopengles.com/tag/degenerate-triangles/
However, this guide is very specific to mobile platforms, and I wanted to know if this technique held for modern desktop hardware. Specifically, would it hurt? Would it either cause graphical artifacts or degrade performance (opposed to splitting the strips into separate primatives?)
If it causes no artifacts and performs at least as well, I aim to use it solely because it makes organizing vertices in a certain mesh I want to draw easier.
Degenerate triangles work pretty well on all platforms. I'm aware of an old fixed-function console that struggled with degenerate triangles, but anything vaguely modern will be fine. Reducing the number of draw calls is always good and I would certainly use degenerates rather than multiple calls to glDrawArrays.
However, an alternative that usually performs better is indexed draws of triangle lists. With a triangle list you have a lot of flexibility to reorder the triangles to take maximum advantage of the post-transform cache. The post-transform cache is a hardware cache of the last few vertices that went through the vertex shader, the GPU can spot if you've re-issued the same vertex and skip the entire vertex shader for that vertex.
In addition to the above answers (no it shouldn't hurt at all unless you do something mad in terms of the ratio of real triangles to the degenerates), also note that the newer versions of OpenGL and OpenGL ES (3.x or higher) APIs support a means to insert breaks into index lists without needing an actual degenerate triangle, which is called primitive restart.
https://www.khronos.org/opengles/sdk/docs/man3/html/glEnable.xhtml
When enabled you can encode "MAX_INT" for the index type, and when detected that forces the GPU to restart building a new tristrip from the next index value.
It will not cause artifacts. As to "degrading performance"... relative to what? Relative to a random assortment of triangles with no indexing? Yes, it will be faster than that.
But there are plenty of other things one can do. For example, primitive restarting, which removes the need for degenerate triangles. Then there's using ordered lists of triangles for improved cache coherency. Will triangle strips be faster than that?
It rather depends on what you're rendering, how expensive your vertex shaders are, and various other things.
But at the end of the day, if you care about maximum performance on particular platforms, then you should profile for each platform and pick the vertex data based on what platform you're running on. If performance is really that important to you, then you're going to have to put forth some effort.

glUniform vs. single draw call performance

Suppose I have many meshes I'd like to render. I have two choices:
Bake transforms and colors for each mesh into a VBO and render with a single draw call.
Use glUniform for transforms and colors and use many draw calls (but still a single VBO)
Assuming the scene changes very little between frames, which method tends to be better?
There are more than those two choices. At least one more comes to mind:
...
....
Use attributes for transforms and colors and use many draw calls.
Choice 3 is similar to choice 2, but setting attributes (using calls like glVertexAttrib4f) is mostly faster than setting uniforms. The efficiency of setting uniforms is highly platform dependent. But they're generally not intended to be modified very frequently. They are called uniform for a reason. :)
That being said, choice 1 might be the best for your use case where the transforms/colors change rarely. If you're not doing this yet, you could try keeping the attributes that are modified in a separate VBO (with usage GL_DYNAMIC_DRAW), and the attributes that remain constant in their own VBO (with usage GL_STATIC_DRAW). Then make the necessary updates to the dynamic buffer with glBufferSubData.
The reality is that there are no simple rules to predict what is going to perform best. It will depend on the size of your data and draw calls, how frequent and large the data changes are, and also very much on the platform you run on. If you want to be confident that you're using the most efficient solution, you need to implement all of them, and start benchmarking.
Generally, option 1 (minimize number of draw calls) is the best advice. There are a couple of caveats:
I have seen performance fall off a cliff when using very large VBOs on at least one mobile device (assuming relevant for opengl-es tag). The explanation (from the vendor) involved internal buffers exceeding a certain size.
If putting all the information which would otherwise be conveyed with uniforms into vertex attributes significantly increases the size of the vertex buffer, the price you pay (in perhaps costly memory reads) of reading redundant information (because it doesn't really vary per vertex) might negate the savings of using fewer draw calls.
As always the best (but tiresome) advice is to test (I know this is particularly hard developing for mobile where there are many potential implementations your code could be running on). Try to keep your pipeline/toolchain flexible enough that you can easily try out and compare different options.

Which is a larger performance drain: quantity of vertices in one draw call, or quantity of calls?

I am quickly finding that one of the organisational considerations you must make when preparing rendering in OpenGL is the type of topography and the arrangement of vertices.
Now there are some interesting methods out there for organising vertices into very long arrays, with nice uses of interleaved arrays, indexes, etc, so that you can pour a lot of geometry into one OpenGL call.
But it's much easier in some cases to simply iterate and perform multiple calls with smaller vertex arrays.
While I agree with the notion that premature optimization is somewhat wasteful, just how important of a consideration should it be to minimize OpenGL calls, especially if multiple calls would actually involve far fewer vertexes per call?
I can see that this is one of those decisions that is important early in the development process, since it forms a lot of the structure of how vertexes get created and organized.
There is an overhead for each command you send down to the GPU. By batching the vertices you minimize that overhead and also allows the driver to make small optimizations in you data before sending it to the hardware. It can make quite a difference and is the reason the glBegin and glEnd was completely removed from newer iterations of OpenGL.
You should try to avoid making many driver states changes and many drawing calls.
EDIT: Consider using degenerated vertices in you triangle strips (also helps in minimizing the number of vertices processed) so that you can just use one drawing call and render all your topology (unless you need to change some driver state between parts of the topology).
You can find a balance for your specific needs. But the thing is that there're many variables in the equation. And there's no simple solution (like "always make scene as one big single batch!"). TraxNet gave you a good advice though - always try to minimize api calls(whether drawing or state changes). But it hasn't to be just a few calls. On modern PC it could be thousands per frame, not so modern mobile phone, maybe, just a few hundred.
Also TraxNet mentioned degenerate triangles(helping form strips) - though they're still triangles(kinda add to 'total' triangle count rendered) - they cost almost nothing still helping to minimize amount of draw calls.