Blending in opengl compute shaders - opengl

Since imageAtomicAdd (which seems to be the only real atomic "read-modify-store" function that operates on images) is only available for 32bit integers, I don't see any sensible way to accumulate multiple color values from different shader invocations in one pixel.
The only somewhat reasonable way to do this that I can see is to use 32bit per color (128bit per RGBA pixel), add 8bit color values up, hope that it doesn't overflow and clamp to 8bit afterwards.
This seems wasteful and restrictive (only pure additive blending?)
Accumulating in other data structures also doesn't solve the issue, since shared variables and ssbos also only seem to support atomicAdd and also only on integers.
There are two reasons that make me think I am probably missing something:
1. Every pathtracer that allows for concurrent intersection testing (for example for shadow rays) has to solve this issue so it seems like there must be a solution.
2. All kinds of fancy blending can be done in fragment shaders, so the hardware is definitely capable of doing this.
Is everyone just writing pathtracers that have a 1:1 shader invocation:pixel mapping?

Related

Performance gain of glColorMask()/glDepthMask() on modern hardware?

In my application I have some shaders which write only depth buffer to use it later for shadowing. Also I have some other shaders which render a fullscreen quad whose depth will not affect all subsequent draw calls, so it's depth values may be thrown away.
Assuming the application runs on modern hardware (produced 5 years ago till now), will I gain any additional performance if I disable color buffer writing (glColorMask(all to GL_FALSE)) for shadow map shaders, and depth buffer writing (with glDepthMask()) for fullscreen quad shaders?
In other words, do these functions really disable some memory operations or they just alter some mask bits which are used in fixed bitwise-operations logic in this part of rendering pipeline?
And the same question about testing. If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
My FPS measurement don't show any significant difference, but the result may be different on another machine.
Finally, if rendering runs faster with depth/color test/write disabled, how much faster does it run? Wouldn't this performance gain be negated by gl functions call overhead?
Your question is missing a very important thing: you have to do something.
Every fragment has color and depth values. Even if your FS doesn't generate a value, there will still be a value there. Therefore, every fragment produced that is not discarded will write these values, so long as:
The color is routed to a color buffer via glDrawBuffers.
There is an appropriate color/depth buffer attached to the FBO.
The color/depth write mask allows it to be written.
So if you're rendering and you don't want to write one of those colors or to the depth buffer, you've got to do one of these. Changing #1 or #2 is an FBO state change, which is among the most heavyweight operations you can do in OpenGL. Therefore, your choices are to make an FBO change or to change the write mask. The latter will always be the more performance-friendly operation.
Maybe in your case, your application doesn't stress the GPU or CPU enough for such a change to matter. But in general, changing write masks are a better idea than playing with the FBO.
If I know beforehand that all fragments will pass depth test, will disabling depth test improve performance?
Are you changing other state at the same time, or is that the only state you're interested in?
One good way to look at these kinds of a priori performance questions is to look at Vulkan or D3D12 and see what it would require in that API. Changing any pipeline state there is a big deal. But changing two pieces of state is no bigger of a deal than one.
So if changing the depth test correlates with changing other state (blend modes, shaders, etc), it's probably not going to hurt any more.
At the same time, if you really care enough about performance for this sort of thing to matter, you should do application testing. And that should happen after you implement this, and across all hardware of interest. And your code should be flexible enough to easily switch from one to the other as needed.

For simple rendering: Is OpenCL faster than OpenGL?

I need to draw hundreds of semi-transparent circles as part of my OpenCL pipeline.
Currently, I'm using OpenGL (with alpha blend), synced (for portability) using clFinish and glFinish with my OpenCL queue.
Would it be faster to do this rendering task in OpenCL? (assuming the rest of the pipeline is already in OpenCL, and may run on CPU if a no OpenCL-compatible GPU is available).
It's easy replace the rasterizer with a simple test function in the case of a circle. The blend function requires a single read from the destination texture per fragment. So a naive OpenCL implementation seems to be theoretically faster. But maybe OpenGL can render non-overlapping triangles in parallel (this would be harder to implement in OpenCL)?
Odds are good that OpenCL-based processing would be faster, but only because you don't have to deal with CL/GL interop. The fact that you have to execute a glFinish/clFinish at all is a bottleneck.
This has nothing to do with fixed-function vs. shader hardware. It's all about getting rid of the synchronization.
Now, that doesn't mean that there aren't wrong ways to use OpenCL to render these things.
What you don't want to do is write colors to memory with one compute operation, then read from another compute op, blend, and write them back out to memory. That way lies madness.
What you ought to do instead is effectively build a tile-based renderer internally. Each workgroup will represent some count of pixels (experiment to determine the best count for performance). Each invocation operates on a single pixel. They'll use their pixel position, do the math to determine whether the pixel is within the circle (and how much of it is within the circle), then blend that with a local variable the invocation keeps internally. So each invocation processes all of the circles, only writing their pixel's worth of data out at the very end.
Now if you want to be clever, you can do culling, so that each work group is given only the circles that are guaranteed to affect at least some pixel within their particular area. That is effectively a preprocessing pass, and you could even do that on the CPU, since it's probably not that expensive.

texture(...) via textureoffset(...) performance in glsl

Does utilizing textureOffset(...) increase performance compared to calculating offsets manually and using regular texture(...) function?
As there is a GL_MAX_PROGRAM_TEXEL_OFFSET property, I would guess that it can fetch offseted texels in a single, or at least as few as possible, fetches making it superb for example blurring effects, but I cant seem to find out how it works internally anywhere?
Update:
Reformulating question: is it common among gl-drivers to make any optimizations regarding texture fetches when utilizing the textureOffset(...) function?
You're asking the wrong question. The question should not be whether the more specific function will always have better performance. The question is whether the more specific function will ever be slower.
And there's no reason to expect it to be slower. If the hardware has no specialized functionality for offset texture accesses, then the compiler will just offset the texture coordinate manually, exactly like you could. If there is hardware to help, then it will use it.
So if you have need of textureOffsets and can live within its limitations, there's no reason not to use it.
I would guess that it can fetch offseted texels in a single, or at least as few as possible, fetches making it superb for example blurring effects
No, that's textureGather. textureOffset is for doing exactly what its name says: accessing a texture based on a texture coordinate, with an texel offset from that coordinate's location.
textueGather samples from multiple neighboring texels all at once. If you need to read a section of a texture to do bluring, textureGather (and textureGatherOffset) are going to be more useful than textureOffset.

OpenGL - Power Of Two Textures

OpenGL uses power-of-two textures.
This is because some GPUs only accept power-of-two textures due to MipMapping. Using these power-of-two textures causes problems when drawing a texture larger than it is.
I had thought of one way to workaround this, which is to only use the PO2 ratios when we're making the texture smaller than it actually is, and using a 1:1 ratio when we're making it bigger, but will this create compatibility issues with some GPUs?
If anybody knows whether issues would occur (I cannot check this as my GPU accepts NPO2 Textures), or a better workaround, I would be grateful.
Your information is outdated. Arbitrary dimension textures are supported since OpenGL-2, which has been released in 2004. All contemporary GPUs do support NPOT2 textures very well, and without any significant performance penality.
There's no need for any workarounds.

How does OpenGL convert single component textures?

I am confused as to how OpenGL stores single component textures(like GL_RED).
The GL converts it to floating point and assembles it into an RGBA element by attaching 0 for green and blue, and 1 for alpha.
Does this mean that my texture will take 32 bpp in graphic memory even though I only give 8 bpp?
Also I would like to know how OpenGL converts bytes to float for operations in the shader. It doesn't seem logical to simply divide by 255..
You don't know, and you have no way of knowing (ok ok, I kind of lied... there exists documentation which tells you those details for some particular hardware. But in general you have no way of knowing, because you don't know in advance what hardware your program will run on).
OpenGL stores textures somewhat following your request, but it finally chooses something that the hardware supports. If that means that it converts your input data to something completely different, it does that silently.
For example, most implementations convert RGB to RGBA because that's more convenient for memory accesses. The same goes for 5-5-5 data being converted to 8-8-8 and similar.
Usually, a 8 bpp texture will take only 1 byte per pixel nowadays (since pretty much every card supports that, and for software implementations it does not matter), though this is not something you can 100% rely on. You should not worry either, though... it will make sure that it somehow works.
Similar can happen with non-power-of-two textures too, by the way. On all modern versions of OpenGL, this is supported (beginning with 2.0 if I remember right). Though, at least in theory, some older cards might not support this feature.
In that case, OpenGL would just silently make the texture the next bigger power-of-two size and only use a part of it (without telling you!).