For simple rendering: Is OpenCL faster than OpenGL? - opengl

I need to draw hundreds of semi-transparent circles as part of my OpenCL pipeline.
Currently, I'm using OpenGL (with alpha blend), synced (for portability) using clFinish and glFinish with my OpenCL queue.
Would it be faster to do this rendering task in OpenCL? (assuming the rest of the pipeline is already in OpenCL, and may run on CPU if a no OpenCL-compatible GPU is available).
It's easy replace the rasterizer with a simple test function in the case of a circle. The blend function requires a single read from the destination texture per fragment. So a naive OpenCL implementation seems to be theoretically faster. But maybe OpenGL can render non-overlapping triangles in parallel (this would be harder to implement in OpenCL)?

Odds are good that OpenCL-based processing would be faster, but only because you don't have to deal with CL/GL interop. The fact that you have to execute a glFinish/clFinish at all is a bottleneck.
This has nothing to do with fixed-function vs. shader hardware. It's all about getting rid of the synchronization.
Now, that doesn't mean that there aren't wrong ways to use OpenCL to render these things.
What you don't want to do is write colors to memory with one compute operation, then read from another compute op, blend, and write them back out to memory. That way lies madness.
What you ought to do instead is effectively build a tile-based renderer internally. Each workgroup will represent some count of pixels (experiment to determine the best count for performance). Each invocation operates on a single pixel. They'll use their pixel position, do the math to determine whether the pixel is within the circle (and how much of it is within the circle), then blend that with a local variable the invocation keeps internally. So each invocation processes all of the circles, only writing their pixel's worth of data out at the very end.
Now if you want to be clever, you can do culling, so that each work group is given only the circles that are guaranteed to affect at least some pixel within their particular area. That is effectively a preprocessing pass, and you could even do that on the CPU, since it's probably not that expensive.

Related

What is GPU driven rendering? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Nowadays I'm hearing from different places about the so called GPU driven rendering which is a new paradigm of rendering which doesn't require draw calls at all, and that it is supported by the new versions of OpenGL and Vulkan APIs. Can someone explain how it actually works on conceptual level and what are the main differences with the traditional approach?
Overview
In order to render a scene, a number of things have to happen. You need to walk your scene graph to figure out which objects exist. For each object which exists, you now need to determine if it is visible. For each object which is visible, you need to figure out where its geometry is stored, which textures and buffers will be used to render that object, which shaders to use to render the object, and so forth. Then you render that object.
The "traditional" method handling this is for the CPU to handle this process. The scene graph lives in CPU-accessible memory. The CPU does visibility culling on that scene graph. The CPU takes the visible objects and access some CPU data about the geometry (OpenGL buffer object and texture names, Vulkan descriptor sets and VkBuffers, etc), shaders, etc, transferring this as state data to the GPU. Then the CPU issues a GPU command to render that object with that state.
Now, if we go back farther, the most "traditional" method doesn't involve a GPU at all. The CPU would just take this mesh and texture data, do vertex transformations, rasterizatization, and so forth, producing an image in CPU memory. However, we started off-loading some of this to a separate processor. We started with the rasterization stuff (the earliest graphics chips were just rasterizers; the CPU did all the vertex T&L). Then we incorporated the vertex transformations into the GPU. When we did that, we started having to store vertex data in GPU accessible memory so the GPU could read it on its own time.
We did all of that, off-loading these things to a separate processor for two reasons: the GPU was (much) faster at it, and the CPU can now spend its time doing something else.
GPU driven rendering is just the next stage in that process. We went from no GPU, to rasterization GPU, to vertex GPU, and now to scene-graph-level GPU. The "traditional" method offloads how to render to the GPU; GPU driven rendering offloads the decision of what to render.
Mechanism
Now, the reason we haven't been doing this all along is because the basic rendering commands all take data that comes from the CPU. glDrawArrays/Elements takes a number of parameters from the CPU. So even if we used the GPU to generate that data, we would need a full GPU/CPU synchronization so that the CPU could read the data... and give it right back to the GPU.
That's not helpful.
OpenGL 4 gave us indirect rendering of various forms. The basic idea is that, instead of taking those parameters from a function call, they're just data stored in GPU memory. The CPU still has to make a function call to start the rendering operation, but the actual parameters to that call are just data stored in GPU memory.
The other half of that requires the ability of the GPU to write data to GPU memory in a format that indirect rendering can read. Historically, data on GPUs goes in one direction: data gets read for the purpose of being converted into pixels in a render target. We need a way to generate semi-arbitrary data from other arbitrary data, all on the GPU.
The older mechanism for this was to (ab)use transform feedback for this purpose, but nowadays we just use SSBOs or failing that, image load/store. Compute shaders help here as well, since they are designed to be outside of the standard rendering pipeline and therefore are not bound to its limitations.
The ideal form of GPU-driven rendering makes the scene-graph part of the rendering operation. There are lesser forms, such as having the GPU do nothing more than per-object viewport culling. But let's look at the most ideal process. From the perspective of the CPU, this looks like:
Update the scene graph in GPU memory.
Issue one or more compute shaders that generate multi-draw indirect rendering commands.
Issue a single multi-draw indirect call that draws everything.
Now of course, there's no such thing as a free lunch. Doing full scene graph processing on the GPU requires building your scene graph in a way that is efficient for GPU processing. Even more importantly, visibility culling mechanisms have to be engineered with efficient GPU processing in mind. That's complexity I'm not going to address here.
Implementation
Instead, let's look at the nuts-and-bolts of making the drawing part work. We have to sort out a lot of things here.
See, the indirect rendering command is still a regular old rendering command. While the multi-draw form draws multiple distinct "objects", it's still one CPU rendering command. This means that, for the duration of this command, all rendering state is fixed.
So everything under the purview of this multi-draw operation must use the same shader, bound buffers&textures, blending parameters, stencil state, and so forth. This makes implementing a GPU-driven rendering operation a bit complicated.
State and Shaders
If you need blending, or similar state-based differences in rendering operations, then you are going to have to issue another rendering command. So in the blending case, your scene-graph processing is going to have to compute multiple sets of rendering commands, with each set being for a specific set of blending modes. You may also need to have this system sort transparent objects (unless you're rendering them with an OIT mechanism). So instead of having just one rendering command, you have a small number of them.
But the point of this exercise however isn't to have only one rendering command; the point is that the number of CPU rendering commands does not change with regard to how much stuff you're rendering. It shouldn't matter how many objects are in the scene; the CPU will be issuing the same number of rendering commands.
When it comes to shaders, this technique requires some degree of "ubershader" style: where you have a very few number of rather flexible shaders. You want to parameterize your shader rather than having dozens or hundreds of them.
However things were probably going to fall out that way anyway, particularly with regard to deferred rendering. The geometry pass of deferred renderers tends to use the same kind of processing, since they're just doing vertex transformation and extracting material parameters. The biggest difference usually is with regard to doing skinned vs. non-skinned rendering, but that's really only 2 shader variations. Which you can handle similarly to the blending case.
Speaking of deferred rendering, the GPU driven processes can also walk the graph of lights, thus generating the draw calls and rendering data for the lighting passes. So while the lighting pass will need a separate draw call, it will still only need a single multidraw call regardless of the number of lights.
Buffers
Here's where things start to get interesting. See, if the GPU is processing the scene graph, that means that the GPU needs to somehow associate a particular draw within the multi-draw command with the resources that particular draw needs. It may also need to put the data into those resources, like the matrix transforms for a given object and so forth.
Oh, and you also somehow need to tie the vertex input data to a particular sub-draw.
That last part is probably the most complicated. The buffers which OpenGL/Vulkan's standard vertex input method pull from are state data; they cannot change between sub-draws of a multi-draw operation.
Your best bet is to try to put every object's data in the same buffer object, using the same vertex format. Essentially, you have one gigantic array of vertex data. You can then use the drawing parameters for the sub-draw to select which parts of the buffer(s) to use.
But what do we do about per-object data (matrices, etc), things you would typically use a UBO or global uniform for? How do you effectively change the buffer binding state within a CPU rendering command?
Well... you can't. So you cheat.
First, you realize that SSBOs can be arbitrarily large. So you don't really need to change buffer binding state. What you need is a single SSBO that contains everyone's per-object data. For each vertex, the VS simply needs to pick out the correct data for that sub-draw from the giant list of data.
This is done via a special vertex shader input: gl_DrawID. When you issue a multi-draw command, the VS gets an input value that represents the index of this sub-draw operation within the multidraw command. So you can use gl_DrawID to index into a table of per-object data to fetch the appropriate data for that particular object.
This also means that the compute shader which generates this sub-draw also needs use the index of that sub-draw to define where in the array to put the per-object data for that sub-draw. So the CS that writes a sub-draw also needs to be responsible for setting up the per-object data that matches the sub-draw.
Textures
OpenGL and Vulkan have pretty strict limits on the number of textures that can be bound. Well actually those limits are quite large relative to traditional rendering, but in GPU driven rendering land, we need a single CPU rendering call to potentially access any texture. That's harder.
Now, we do have gl_DrawID; coupled with the table mentioned above, we can retrieve per-object data. So: how do we convert this to a texture?
There are multiple ways. We could put a bunch of our 2D textures into an array texture. We can then use gl_DrawID to fetch an array index from our SSBO of per-object data; that array index becomes the array layer we use to fetch "our" texture. Note that we don't use gl_DrawID directly because multiple different sub-draws could use the same texture, and because the GPU code that sets up the array of draw calls does not control the order in which textures appear in our array.
Array textures have obvious downsides, the most notable of which is that we must respect the limitations of an array texture. All elements in the array must use the same image format. They must all be of the same size. Also, there are limits on the number of array layers in an array texture, so you might encounter them.
The alternatives to array textures differ along API lines, though they basically boil down to the same thing: convert a number into a texture.
In OpenGL land, you can employ bindless texturing (for hardware that supports it). This system provides a mechanism that allows one to generate a 64-bit integer handle which represents a particular texture, pass this handle to the GPU (since it is just an integer, use whatever mechanism you want), and then convert this 64-bit handle into a sampler type. So you use gl_DrawID to fetch a 64-bit handle from the per-object data, then convert that into a sampler of the appropriate type and use it.
In Vulkan land, you can employ sampler arrays (for hardware that supports it). Note that these are not array textures; in GLSL, these are sampler types which are arrayed: uniform sampler2D my_2D_textures[6000];. In OpenGL, this would be a compile error because each array element represents a distinct bind point for a texture, and you cannot have 6000 distinct bind points. In Vulkan, an arrayed sampler only represents a single descriptor, no matter how many elements are in that array. Vulkan implementations have limits on how many elements there can be in such arrays, but hardware that supports the feature you need to employ this (shaderSampledImageArrayDynamicIndexing) will typically offer a generous limit.
So your shader uses gl_DrawID to get an index from the per-object data. The index is turned into a sampler by just fetching the value from the sampler array. The only limitation for textures in that arrayed descriptor is that they must all be of the same type and basic data format (floating-point 2D for sampler2D, unsigned integer cubemap for usamplerCube, etc). The specifics of formats, texture sizes, mipmap counts, and the like are all irrelevant.
And if you're concerned about the cost difference of Vulkan's array of samplers compared to OpenGL's bindless, don't be; implementations of bindless are just doing this behind your back anyway.

Performance gain of glColorMask()/glDepthMask() on modern hardware?

In my application I have some shaders which write only depth buffer to use it later for shadowing. Also I have some other shaders which render a fullscreen quad whose depth will not affect all subsequent draw calls, so it's depth values may be thrown away.
Assuming the application runs on modern hardware (produced 5 years ago till now), will I gain any additional performance if I disable color buffer writing (glColorMask(all to GL_FALSE)) for shadow map shaders, and depth buffer writing (with glDepthMask()) for fullscreen quad shaders?
In other words, do these functions really disable some memory operations or they just alter some mask bits which are used in fixed bitwise-operations logic in this part of rendering pipeline?
And the same question about testing. If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
My FPS measurement don't show any significant difference, but the result may be different on another machine.
Finally, if rendering runs faster with depth/color test/write disabled, how much faster does it run? Wouldn't this performance gain be negated by gl functions call overhead?
Your question is missing a very important thing: you have to do something.
Every fragment has color and depth values. Even if your FS doesn't generate a value, there will still be a value there. Therefore, every fragment produced that is not discarded will write these values, so long as:
The color is routed to a color buffer via glDrawBuffers.
There is an appropriate color/depth buffer attached to the FBO.
The color/depth write mask allows it to be written.
So if you're rendering and you don't want to write one of those colors or to the depth buffer, you've got to do one of these. Changing #1 or #2 is an FBO state change, which is among the most heavyweight operations you can do in OpenGL. Therefore, your choices are to make an FBO change or to change the write mask. The latter will always be the more performance-friendly operation.
Maybe in your case, your application doesn't stress the GPU or CPU enough for such a change to matter. But in general, changing write masks are a better idea than playing with the FBO.
If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
Are you changing other state at the same time, or is that the only state you're interested in?
One good way to look at these kinds of a priori performance questions is to look at Vulkan or D3D12 and see what it would require in that API. Changing any pipeline state there is a big deal. But changing two pieces of state is no bigger of a deal than one.
So if changing the depth test correlates with changing other state (blend modes, shaders, etc), it's probably not going to hurt any more.
At the same time, if you really care enough about performance for this sort of thing to matter, you should do application testing. And that should happen after you implement this, and across all hardware of interest. And your code should be flexible enough to easily switch from one to the other as needed.

Ray tracing via Compute Shader vs Screen Quad

I was recently looking for ray tracing via opengl tutorials. Most of tutorials prefer compute shaders. I wonder why don't they just render to texture, then render the texture to screen as quad.
What is the advantages and disadvantages of compute shader method over screen quad?
Short answer: because compute shaders give you more effective tools to perform complex computations.
Long answer:
Perhaps the biggest advantage that they afford (in the case of tracing) is the ability to control exactly how work is executed on the GPU. This is important when you're tracing a complex scene. If your scene is trivial (e.g., Cornell Box), then the difference is negligible. Trace some spheres in your fragment shader all day long. Check http://shadertoy.com/ to witness the madness that can be achieved with modern GPUs and fragment shaders.
But. If your scene and shading are quite complex, you need to control how work is done. Rendering a quad and doing the tracing in a frag shader is going to, at best, make your application hang while the driver cries, changes its legal name, and moves to the other side of the world...and at worst, crash the driver. Many drivers will abort if a single operation takes too long (which virtually never happens under standard usage, but will happen awfully quickly when you start trying to trace 1M poly scenes).
So you're doing too much work in the frag shader...next logical though? Ok, limit the workload. Draw smaller quads to control how much of the screen you're tracing at once. Or use glScissor. Make the workload smaller and smaller until your driver can handle it.
Guess what we've just re-invented? Compute shader work groups! Work groups are compute shader's mechanism for controlling job size, and they're a far better abstraction for doing so than fragment-level hackery (when we're dealing with this kind of complex task). Now we can very naturally control how many rays we dispatch, and we can do so without being tightly-coupled to screen-space. For a simple tracer, that adds unnecessary complexity. For a 'real' one, it means that we can easily do sub-pixel raycasting on a jittered grid for AA, huge numbers of raycasts per pixel for pathtracing if we so desire, etc.
Other features of compute shaders that are useful for performant, industrial-strength tracers:
Shared Memory between thread groups (allows, for example, packet tracing, wherein an entire packet of spatially-coherent rays are traced at the same time to exploit memory coherence & the ability to communicate with nearby rays)
Scatter Writes allow compute shaders to write to arbitrary image locations (note: image and texture are different in subtle ways, but the advantage remains relevant); you no longer have to trace directly from a known pixel location
In general, the architecture of modern GPUs are designed to support this kind of task more naturally using compute. Personally, I have written a real-time progressive path tracer using MLT, kd-tree acceleration, and a number of other computationally-expensive techniques (PT is already extremely expensive). I tried to remain in a fragment shader / full-screen quad as long as I could. Once my scene was complex enough to require an acceleration structure, my driver started choking no matter what hackery I pulled. I re-implemented in CUDA (not quite the same as compute, but leveraging the same fundamental GPU architectural advances), and all was well with the world.
If you really want to dig in, have a glance at section 3.1 here: https://graphics.cg.uni-saarland.de/fileadmin/cguds/papers/2007/guenther_07_BVHonGPU/Guenter_et_al._-_Realtime_Ray_Tracing_on_GPU_with_BVH-based_Packet_Traversal.pdf. Frankly the best answer to this question would be an extensive discussion of GPU micro-architecture, and I'm not at all qualified to give that. Looking at modern GPU tracing papers like the one above will give you a sense of how deep the performance considerations go.
One last note: any performance advantage of compute over frag in the context of raytracing a complex scene has absolutely nothing to do with rasterization / vertex shader overhead / blending operation overhead, etc. For a complex scene with complex shading, bottlenecks are entirely in the tracing computations, which, as discussed, compute shaders have tools for implementing more efficiently.
I am going to complete Josh Parnell information.
One problem with both fragment shader and compute shader is that they both lack recursivity.
A ray tracer is recursive by nature (yeah I know it is always possible to transform a recursive algorithm in a non recursive one, but is is not always that easy to do it).
So another way to see the problem could be the following :
Instead to have "one thread" per pixel, one idea could be to have one thread per path (a path is a part of your ray (between 2 bounces)).
Going that way, you are dispatching on your "bunch" of rays instead on your "pixel grid". Doing so simplify the potential recursivity of the ray tracer, and avoid divergence in complex materials :
More information here :
http://research.nvidia.com/publication/megakernels-considered-harmful-wavefront-path-tracing-gpus

OpenGL: Are degenerate triangles in a Triangle Strip acceptable outside of OpenGL-ES?

In this tutorial for OpenGL ES, techniques for optimizing models are explained and one of those is to use triangle strips to define your mesh, using "degenerate" triangles to end one strip and begin another without ending the primitive. http://www.learnopengles.com/tag/degenerate-triangles/
However, this guide is very specific to mobile platforms, and I wanted to know if this technique held for modern desktop hardware. Specifically, would it hurt? Would it either cause graphical artifacts or degrade performance (opposed to splitting the strips into separate primatives?)
If it causes no artifacts and performs at least as well, I aim to use it solely because it makes organizing vertices in a certain mesh I want to draw easier.
Degenerate triangles work pretty well on all platforms. I'm aware of an old fixed-function console that struggled with degenerate triangles, but anything vaguely modern will be fine. Reducing the number of draw calls is always good and I would certainly use degenerates rather than multiple calls to glDrawArrays.
However, an alternative that usually performs better is indexed draws of triangle lists. With a triangle list you have a lot of flexibility to reorder the triangles to take maximum advantage of the post-transform cache. The post-transform cache is a hardware cache of the last few vertices that went through the vertex shader, the GPU can spot if you've re-issued the same vertex and skip the entire vertex shader for that vertex.
In addition to the above answers (no it shouldn't hurt at all unless you do something mad in terms of the ratio of real triangles to the degenerates), also note that the newer versions of OpenGL and OpenGL ES (3.x or higher) APIs support a means to insert breaks into index lists without needing an actual degenerate triangle, which is called primitive restart.
https://www.khronos.org/opengles/sdk/docs/man3/html/glEnable.xhtml
When enabled you can encode "MAX_INT" for the index type, and when detected that forces the GPU to restart building a new tristrip from the next index value.
It will not cause artifacts. As to "degrading performance"... relative to what? Relative to a random assortment of triangles with no indexing? Yes, it will be faster than that.
But there are plenty of other things one can do. For example, primitive restarting, which removes the need for degenerate triangles. Then there's using ordered lists of triangles for improved cache coherency. Will triangle strips be faster than that?
It rather depends on what you're rendering, how expensive your vertex shaders are, and various other things.
But at the end of the day, if you care about maximum performance on particular platforms, then you should profile for each platform and pick the vertex data based on what platform you're running on. If performance is really that important to you, then you're going to have to put forth some effort.

Efficient Image Pyramid in CUDA?

What's the most efficient way to do image pyramiding in CUDA? I have written my own kernels to do so but imagine we can do better.
Binding to an OpenGL texture using OpenGL interop and using the hardware mipmapping would probably be much faster. Any pointers on how to do this or other
MipMaps are setup when accessed/initialized in OpenGL/DirectX. A CUDA kernel can do the same thing if you allocate a texture 50% wider (or higher) than the initial texture and use the kernel to down-sample the texture and write the result beside the original texture. The kernel will probably work best where each thread evaluates a pixel in the next down-sampled image. It's up to you to determine the sampling-scheme and choose appropriate weights for combining the pixels. Try bilinear to start with, then once it's working you can setup trilinear (cubic) or other sampling schemes like anisotropic etc. Simple sampling (linear and cubic) will likely be more efficient since coalesced memory access will occur (refer to the CUDA SDK programming guide). You will probably need to tile the kernel execution since the thread-count is limited for parallel invokation (too many pixels, too few threads = use tiling to chunk parallel execution).You might find Mesa3D useful as a reference (it's an open-source implementation of OpenGL).