Write to mutiple 3Dtextures in fragment shader OpenGL - c++

I have a 3D texture where I write data and use it as voxels in the fragment shader in this way:
#extension GL_ARB_shader_image_size : enable
...
layout (binding = 0, rgba8) coherent uniform image3D volumeTexture;
...
void main(){
vec4 fragmentColor = ...
vec3 coords = ...
imageStore(volumeTexture, ivec3(coords), fragmentColor);
}
and the texture is defined in this way
glGenTextures(1, &volumeTexture);
glBindTexture(GL_TEXTURE_3D, volumeTexture);
glTexImage3D(GL_TEXTURE_3D, 0, GL_RGBA8, volumeDimensions, volumeDimensions, volumeDimensions, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
and then this when I have to use it
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_3D, volumeTexture);
now my issue is that I would like to have a mipmapped version of this and without using the opengl function because I noticed that it is extremely slow. So I was thinking of writing in the 3D texture at all levels at the same time so, for instance, the max resolution is 512^3 and as I write 1 voxel VALUE in that 3dtex I also write 0.125*VALUE for the 256^3 voxel and 0.015625*VALUE for the 126^3 etc. Since I am using imageStore, which uses atomicity all values will be written and using these weights I would automatically get the average value (not exactly like an interpolation but i might get a pleasing result anyway).
So my question is, what is the best way to have multiple 3dtextures and writing in all of them at the same time?

I believe hardware mipmapping is about as fast as you'll get. I've always assumed attempting custom mipmapping would be slower given you have to bind and rasterize to each layer manually in turn. Atomics will give huge contention and it'll be amazingly slow. Even without atomics you'd be negating the nice O(log n) construction of mipmaps.
You have to be really careful with imageStore with regard to access order and cache. I'd start here and try some different indexing (eg row/column vs column/row).
You could try drawing to the texture the older way, by binding it to an FBO and drawing a full screen triangle (big triangle that covers the viewport) with glDrawElementsInstanced. In the geometry shader, set gl_Layer to the instance ID. The rasterizer creates fragments for x/y and the layer gives z.
Lastly, 512^3 is simply a huge texture even by todays standards. Maybe find out your theoretical max gpu bandwidth to get an idea of how far away you are. E.G. lets say your GPU can do 200GB/s. You'll probably only get 100 in a good case anyway. Your 512^3 texture is 512MB so you might be able to write to it in ~5ms (imo this seems awfully fast, maybe I made a mistake). Expect some overhead and latency from the rest of the pipeline, spawning and executing threads etc. If you're writing complex stuff then memory bandwidth isn't the bottleneck and my estimation goes out the window. So try just writing zeroes first. Then try changing the coords xyz order.
Update: Instead of using the fragment shader to create your threads, the vertex shader can be used instead, and in theory avoids rasterizer overhead though I've seen cases where it doesn't perform as well. You glEnable(GL_RASTERIZER_DISCARD), glDrawArrays(GL_POINTS, 0, numThreads) and use gl_VertexID as your thread index.

Related

Slow compute shader, global vs local work groups?

I have created this simple compute shader to go through a 3D texture and set alpha values greater than 0 to 1:
#version 440 core
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0, RGBA8) uniform image3D voxelTexture;
void main() {
ivec3 pos = ivec3(gl_GlobalInvocationID);
vec4 value = imageLoad(voxelTexture, pos);
if(value.a > 0.0) {
value.a = 1.0;
imageStore(voxelTexture, pos, value);
}
}
I invoke it using the texture dimensions as work group count, size = 128:
opacityFixShader.bind();
glBindImageTexture(0, result.mID, 0, GL_TRUE, 0, GL_READ_WRITE, GL_RGBA8);
glDispatchCompute(size, size, size);
opacityFixShader.unbind();
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
Timing this in RenderDoc using a GTX 1080 Ti results in a whopping 3.722 ms, which seems way too long. I feel like I am not taking full advantage of compute, should I increase the local group size or something?
I feel like I am not taking full advantage of compute, should I increase the local group size or something?
Yes, definitively. An implementation-defined amount of invocations inside each work group will be bundled together as a Warp/Wavefront/Subgroup/Whatever-you-like-to-call-it and executed on the actual SIMD hardware units. For all practical purposes, you should use a multiple of 64 for the local size of the work group, otherwise you will waste lot of potential GPU power.
Your workload will totally be dominated by the memory accesses, so you should also think about optimizing your accesses for cache efficiency. Since you use a 3D texture, I would actually recommend to use a 3D local size like 4x4x4 or 8x8x8 so that you will profit from the 3D data organization your GPU most likely used for internally storing 3D texture data.
Side note:
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
Are you sure about that. If you are going to sample from the texture afterwards, that will be the wrong barrier.
Also:
I have created this simple compute shader to go through a 3D texture and set alpha values greater than 0 to 1
Why are you doing this? This might be a typical X-Y-problem. Spending a whole compute pass on just that might be a bad idea in the first place, and it will never make good use of the compute resources of the GPU. This operation could also potentially be done in the shaders where you actually use the texture, and it might be practically free of cost there because that shader is also very likely to be dominated by the latency of the texture accesses. Another point to consider is that you might access the texture with some texture filtering, and still get alpha values between 0 and 1 even after your pre-process (but maybe you want exactly that, though).

(Modern) OpenGL Different Colored Faces on a Cube - Using Shaders

A cube with different colored faces in intermediate mode is very simple. But doing this same thing with shaders seems to be quite a challenge.
I have read that in order to create a cube with different coloured faces, I should create 24 vertices instead of 8 vertices for the cube - in other words, (I visualies this as 6 squares that don't quite touch).
Is perhaps another (better?) solution to texture the faces of the cube using a real simple texture a flat color - perhaps a 1x1 pixel texture?
My texturing idea seems simpler to me - from a coder's point of view.. but which method would be the most efficient from a GPU/graphic card perspective?
I'm not sure what your overall goal is (e.g. what you're learning to do in the long term), but generally for high performance applications (e.g. games) your goal is to reduce GPU load. Every time you switch certain states (e.g. change textures, render targets, shader uniform values, etc..) the GPU stalls reconfiguring itself to meet your demands.
So, you can pass in a 1x1 pixel texture for each face, but then you'd need six draw calls (usually not so bad, but there is some prep work and potential cache misses) and six texture sets (can be very bad, often as bad as changing shader uniform values).
Suppose you wanted to pass in one texture and use that as a texture map for the cube. This is a little less trivial than it sounds -- you need to express each texture face on the texture in a way that maps to the vertices. Often you need to pass in a texture coordinate for each vertex, and due to the spacial configuration of the texture this normally doesn't end up meaning one texture coordinate for one spatial vertex.
However, if you use an environmental/reflection map, the complexities of mapping are handled for you. In this way, you could draw a single texture on all sides of your cube. (Or on your sphere, or whatever sphere-mapped shape you wanted.) I'm not sure I'd call this easier since you have to form the environmental texture carefully, and you still have to set a different texture for each new colors you want to represent -- or change the texture either via the GPU or in step with the GPU, and that's tricky and usually not performant.
Which brings us back to the canonical way of doing as you mentioned: use vertex values -- they're fast, you can draw many, many cubes very quickly by only specifying different vertex data, and it's easy to understand. It really is the best way, and how GPUs are designed to run quickly.
Additionally..
And yes, you can do this with just shaders... But it'd be ugly and slow, and the GPU would end up computing it per each pixel.. Pass the object space coordinates to the fragment shader, and in the fragment shader test which side you're on and output the corresponding color. Highly not recommended, it's not particularly easier, and it's definitely not faster for the GPU -- to change colors you'd again end up changing uniform values for the shaders.

Why is a simple shader slower than the standard pipeline?

I want to write a very simple shader which is equivalent to (or faster) than the standard pipeline. However, even the simplest shader possible:
Vertex Shader
void main(void)
{
gl_TexCoord[0] = gl_MultiTexCoord0;
gl_Position = ftransform();
}
Fragment Shader
uniform sampler2D Texture0;
void main(void)
{
gl_FragColor = texture2D(Texture0, gl_TexCoord[0].xy);
}
Cuts my framerate half in my game, compared to the standard shader, and performs horrific if some transparent images are displayed. I don't understand this, because the standard shader (glUseProgram(0)) does lighting and alpha blending, while this shader only draws flat textures. What makes it so slow?
It looks like this massive slowdown of custom shaders is a problem with old Intel Graphics chips, which seem to emulate the shaders on the CPU.
I tested the same program on recent hardware and the frame drop with the custom shader activated is only about 2-3 percents.
EDIT: wrong theory. See new answer below
I think you might bump into overdraw.
I don't know what engine you are using your shader on, but if you have alpha blend on then you might end up overdrawing allot.
Think about it this way :
If you have a 800x600 screen, and a 2D quad over the whole screen, that 2D quad will have 480000 fragment shader calls, although it has only 4 vertexes.
Now, moving further, let's assume you have 10 such quads, on on top of another. If you don't sort your geometry Front to Back or if you are using alpha blend with no depth test, then you will end up with 10x800x600 = 4800000 fragment calls.
2D usually is quite expensive on OpenGL due to the overdraw. 3D rejects many fragments. Eventhou the shaders are more complicated, the number of calls are greatly reduced for 3D objects compared to 2D objects.
After long investigation, the slowdown of the simple shader was caused by the shader being too simple.
In my case, the slowdown was caused by the text rendering engine, which made heavy use of "glBitmap", which would be very slow with textures enabled (for whatever reason I cannot understand; these letters are tiny).
However, this did not affect the standard pipeline, as it would acknowledge the feature glDisable(GL_LIGHTING) and glDisable(GL_TEXTURE_2D ), which circumvents the slowdown, whereas the simple shader failed to do so and would thus even do more work as the standard pipeline. After introducing these two features to the custom shader, it is as fast as the standard pipeline, plus the ability to add random effects without any performance impact!

GLSL: How to access nearby vertex colors? (bilinear interpolation without uniforms)

I'm trying to make bilinear color interpolation on a quad, i succeeded with the help of my previous question on here, but it has bad performance because its requires me to repeat glBegin() and glEnd() and 4 times glUniform() before glBegin().
The question is: is it anyhow possible to apply bilinear color interpolation on a quad like this:
glBegin(GL_QUADS);
glColor4f(...); glVertexAttrib2f(uv, 0, 0); glTexCoord2f(...); glVertex3f(...);
glColor4f(...); glVertexAttrib2f(uv, 1, 0); glTexCoord2f(...); glVertex3f(...);
glColor4f(...); glVertexAttrib2f(uv, 1, 1); glTexCoord2f(...); glVertex3f(...);
glColor4f(...); glVertexAttrib2f(uv, 0, 1); glTexCoord2f(...); glVertex3f(...);
... // here can be any amount of quads without repeating glBegin()/glEnd()
glEnd();
To do this, i think i should somehow access the nearby vertex colors, but how? Or is there any other solutions for this?
I need this to work this way so i can easily switch between different interpolation shaders.
Any other solution that works with one glBegin() command is good too, but sending all corner colors per vertex isnt acceptable, unless thats the only solution here?
Edit: The example code uses immediate mode for clarity only. Even with vertex arrays/buffers the problem would be the same: i would have to split the rendering calls into 4 vertices chunks, which causes the whole speed drop here!
Long story short: You cannot do this with a vertex shader.
The interpolator (or rasterizer) is one of the components of the graphics pipeline that is not programmable. Given how the graphics pipe works, neither a vertex shader nor a fragment shader are allowed access to anything but their vertex (or fragment, respectively), for reasons of speed, simplicity, and parallelism.
The workaround is to use a texture lookup, which has already been noted in previous answers.
In newer versions of OpenGL (3.0 and up I believe?) there is now the concept of a geometry shader. Geometry shaders are more complicated to implement than the relatively simple vertex and fragment shaders, but geometry shaders are given topological information. That is, they execute on a primitive (triangle, line, quad, etc) rather than a single point. With that information, they could create additional geometry in order to resolve your alternate color interpolation method.
However, that's far more complicated than necessary. I'd stick with a 4 texel texture map and implement your logic in your fragment lookup.
Under the hood, OpenGL (and all the hardware that it drives) will do everything as triangles, so if you choose to blend colors via vertex interpolation, it will be triangular interpolation because the hardware doesn't work any other way.
If you want "quad" interpolation, you should put your colors into a texture, because in hardware a texture is always "quad" shaped.
If you really think it's the number of draws that cause your performance drop, you can try to use Instancing (Using glDrawArrayInstanced+glVertexAttribDivisor), available in GL 3.1 core.
An alternative might be point sprites, depending on your usage model (mostly, maximum size of your quads, and are they always perpendicular to the view). That's available since GL 2.0 core.
Linear interpolation with colours specified per vertex can be set up efficiently using glColorPointer. Similarly you should use glTexCoordPointer/glVertexAttribPointer/glVertexPointer to replace all those individual per-vertex calls with a single call referencing the data in an array. Then render all your quads with a single (or at most a handful of) glDrawArrays or glDrawElements call. You'll see a huge improvement from this even without VBOs (which just change where the arrays are stored).
You mention you want to change shaders (between ShaderA and ShaderB say) on a quad by quad basis. You should either:
Arrange things so you can batch all of the ShaderA quads together and all the ShaderB quads together and render all of each together with a single call. Changing shader is generally quite expensive so you want to minimise the number of changes.
or
Implement all the different shader logic you want in a single "unified" shader, but selected by another vertex attribute which selects between the different codepaths. Whether this is anywhere near as efficient as the batching approach (which is preferable) depends on whether or not each "tile" of SIMD shaders tends to have to run a mixture of paths or just one.

Quick question about glColorMask and its work

I want to render depth buffer to do some nice shadow mapping. My drawing code though, consists of many shader switches. If I set glColorMask(0,0,0,0) and leave all shader programs, textures and others as they are, and just render the depth buffer, will it be 'OK' ? I mean, if glColorMask disables the "write of color components", does it mean that per-fragment shading IS NOT going to be performed?
For rendering a shadow map, you will normally want to bind a depth texture (preferrably square and power of two, because stereo drivers take this as hint!) to a FBO and use exactly one shader (as simple as possible) for everything. You do not want to attach a color buffer, because you are not interested in color at all, and it puts more unnecessary pressure on ROP (plus, some hardware can render double speed or more with depth-only). You do not want to switch between many shaders.
Depending on whether you do "classic" shadow mapping, or something more sophisticated such as exponential shadow maps, the shader that you will use is either as simple as it can be (constant color, and no depth write), or performs some (moderately complex) calculations on depth, but you normally do not want to perform any colour calculations, since that will mean needless calculations which will not be visible in any way.
No, the fragment operations will be performed anyway, but their result will be squashed by your zero color mask.
If you don't want some fragment operations to be performed - use the proper shader program which has an empty fragment shader attached and set the draw buffer to GL_NONE.
There is another way to disable fragment processing - to enable GL_RASTERIZER_DISCARD, but you won't get even the depth values in this case :)
No, the shader programs execute independent of the fixed function pipeline. Setting the glColorMask will have no effect on the shader programs.