Slow compute shader, global vs local work groups?

Slow compute shader, global vs local work groups? - opengl

I have created this simple compute shader to go through a 3D texture and set alpha values greater than 0 to 1:
#version 440 core
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout(binding = 0, RGBA8) uniform image3D voxelTexture;
void main() {
ivec3 pos = ivec3(gl_GlobalInvocationID);
vec4 value = imageLoad(voxelTexture, pos);
if(value.a > 0.0) {
value.a = 1.0;
imageStore(voxelTexture, pos, value);
}
}
I invoke it using the texture dimensions as work group count, size = 128:
opacityFixShader.bind();
glBindImageTexture(0, result.mID, 0, GL_TRUE, 0, GL_READ_WRITE, GL_RGBA8);
glDispatchCompute(size, size, size);
opacityFixShader.unbind();
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
Timing this in RenderDoc using a GTX 1080 Ti results in a whopping 3.722 ms, which seems way too long. I feel like I am not taking full advantage of compute, should I increase the local group size or something?

I feel like I am not taking full advantage of compute, should I increase the local group size or something?
Yes, definitively. An implementation-defined amount of invocations inside each work group will be bundled together as a Warp/Wavefront/Subgroup/Whatever-you-like-to-call-it and executed on the actual SIMD hardware units. For all practical purposes, you should use a multiple of 64 for the local size of the work group, otherwise you will waste lot of potential GPU power.
Your workload will totally be dominated by the memory accesses, so you should also think about optimizing your accesses for cache efficiency. Since you use a 3D texture, I would actually recommend to use a 3D local size like 4x4x4 or 8x8x8 so that you will profit from the 3D data organization your GPU most likely used for internally storing 3D texture data.
Side note:
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
Are you sure about that. If you are going to sample from the texture afterwards, that will be the wrong barrier.
Also:
I have created this simple compute shader to go through a 3D texture and set alpha values greater than 0 to 1
Why are you doing this? This might be a typical X-Y-problem. Spending a whole compute pass on just that might be a bad idea in the first place, and it will never make good use of the compute resources of the GPU. This operation could also potentially be done in the shaders where you actually use the texture, and it might be practically free of cost there because that shader is also very likely to be dominated by the latency of the texture accesses. Another point to consider is that you might access the texture with some texture filtering, and still get alpha values between 0 and 1 even after your pre-process (but maybe you want exactly that, though).

Related

Is there a workaround for increasing GL_MAX_ARRAY_TEXTURE_LAYERS?

I'm using a texture array to render Minecraft-style voxel terrain. It's working fantastic, but I noticed recently that GL_MAX_ARRAY_TEXTURE_LAYERS is alot smaller than GL_MAX_TEXTURE_SIZE.
My textures are very small, 8x8, but I need to be able to support rendering from an array of hundreds to thousands of them; I just need GL_MAX_ARRAY_TEXTURE_LAYERS to be larger.
OpenGL 4.5 requires GL_MAX_ARRAY_TEXTURE_LAYERS be at least 2048, which might suffice, but my application is targeting OpenGL 3.3, which only guarantees 256+.
I'm drawing up blanks trying to figure out a prudent workaround for this limitation; dividing up the rendering of terrain based on the max number of supported texture layers does not sound trivial at all to me.
I looked into whether ARB_sparse_texture could help, but GL_MAX_SPARSE_ARRAY_TEXTURE_LAYERS_ARB is the same as GL_MAX_ARRAY_TEXTURE_LAYERS; that extension is just a workaround for VRAM usage rather than layer usage.
Can I just have my GLSL shader access from an array of sampler2DArray? GL_MAX_COMBINED_TEXTURE_IMAGE_UNITS has to be at least 80+, so 80+ * 256+ = 20480+ and that would enough layers for my purposes. So, in theory could I do something like this?
const int MAXLAYERS = 256;
vec3 texCoord;
uniform sampler2DArray[] tex;
void main()
{
int arrayIdx = int(texCoord.z + 0.5f) / MAXLAYERS 256
float arrayOffset = texCoord.z % MAXLAYERS;
FragColor = texture(tex[arrayIdx],
vec3(texCoord.x, texCoord.y, arrayOffset));
}

It would be better to ditch array textures and just use a texture atlas (or use an array texture with each layer containing lots of sub-textures, but as I will show, that's highly unnecessary). If you're using textures of such low resolution, you probably aren't using linear interpolation, so you can easily avoid bleed-over from neighboring texels. And even if you have trouble with bleed-over, it can easily be fixed by adding some space between the sub-textures.
Even if your sub-textures need to be 10x10 to avoid bleed-over, a 1024x1024 texture (the minimum size GL 3.3 requires) gives you 102x102 sub-textures, which is 10'404 textures. Which ought to be plenty. And if its not, then make it an array texture with however many layers you need.
Arrays of samplers will not work for your purpose. First, you cannot declare an unsized uniform array of any kind. Well you can, but you have to redeclare it with a size at some point in your shader, so there's no much point to the unsized declaration. The only unsized arrays you can have are in SSBOs, as the last element of the SSBO.
Second, even with a size, the index you use for arrays of opaque types must be a dynamically uniform. And since you're trying to draw all of the faces of the cubes in one draw calls, and each face can have select from a different layer, there is no intent for this expression's value to be dynamically uniform.
Third, even if you did this with bindless texturing, you would run into the same problem: unless you're on NVIDIA hardware, the sampler you pick must be a dynamically uniform sampler. Which requires the index into the array of samplers to be dynamically uniform. Which yours is not.

Write to mutiple 3Dtextures in fragment shader OpenGL

I have a 3D texture where I write data and use it as voxels in the fragment shader in this way:
#extension GL_ARB_shader_image_size : enable
...
layout (binding = 0, rgba8) coherent uniform image3D volumeTexture;
...
void main(){
vec4 fragmentColor = ...
vec3 coords = ...
imageStore(volumeTexture, ivec3(coords), fragmentColor);
}
and the texture is defined in this way
glGenTextures(1, &volumeTexture);
glBindTexture(GL_TEXTURE_3D, volumeTexture);
glTexImage3D(GL_TEXTURE_3D, 0, GL_RGBA8, volumeDimensions, volumeDimensions, volumeDimensions, 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
and then this when I have to use it
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_3D, volumeTexture);
now my issue is that I would like to have a mipmapped version of this and without using the opengl function because I noticed that it is extremely slow. So I was thinking of writing in the 3D texture at all levels at the same time so, for instance, the max resolution is 512^3 and as I write 1 voxel VALUE in that 3dtex I also write 0.125*VALUE for the 256^3 voxel and 0.015625*VALUE for the 126^3 etc. Since I am using imageStore, which uses atomicity all values will be written and using these weights I would automatically get the average value (not exactly like an interpolation but i might get a pleasing result anyway).
So my question is, what is the best way to have multiple 3dtextures and writing in all of them at the same time?

I believe hardware mipmapping is about as fast as you'll get. I've always assumed attempting custom mipmapping would be slower given you have to bind and rasterize to each layer manually in turn. Atomics will give huge contention and it'll be amazingly slow. Even without atomics you'd be negating the nice O(log n) construction of mipmaps.
You have to be really careful with imageStore with regard to access order and cache. I'd start here and try some different indexing (eg row/column vs column/row).
You could try drawing to the texture the older way, by binding it to an FBO and drawing a full screen triangle (big triangle that covers the viewport) with glDrawElementsInstanced. In the geometry shader, set gl_Layer to the instance ID. The rasterizer creates fragments for x/y and the layer gives z.
Lastly, 512^3 is simply a huge texture even by todays standards. Maybe find out your theoretical max gpu bandwidth to get an idea of how far away you are. E.G. lets say your GPU can do 200GB/s. You'll probably only get 100 in a good case anyway. Your 512^3 texture is 512MB so you might be able to write to it in ~5ms (imo this seems awfully fast, maybe I made a mistake). Expect some overhead and latency from the rest of the pipeline, spawning and executing threads etc. If you're writing complex stuff then memory bandwidth isn't the bottleneck and my estimation goes out the window. So try just writing zeroes first. Then try changing the coords xyz order.
Update: Instead of using the fragment shader to create your threads, the vertex shader can be used instead, and in theory avoids rasterizer overhead though I've seen cases where it doesn't perform as well. You glEnable(GL_RASTERIZER_DISCARD), glDrawArrays(GL_POINTS, 0, numThreads) and use gl_VertexID as your thread index.

Most efficient way to perform sum of textures

Which is the best way, from a performance point of view, to perform a (weighted) sum of the content of two texture? I'm fine with either perform this on CPU or GPU as long is a fast method. Note that this must be repeated multiple times, so it's not just a one shot sum of two.
In particular I'm interested in a weighted sum of several texture, but I believe this can be easily generalized from the sum of two.
EDIT:
I'll make more clear my goal. I've to generate several texture (sequentially) with various blurring, therefore these texture will be all generated by rendering on texture. The number of them I don't think will ever be more than 8/9.
At the end the result must displayed on screen.

So if understand the question correctly, you render into some textures, and then need a weighted sum over all of those textures, and want to display just that image. If so, you could just do one extra rendering pass, while having all of your textures bound, and just calculate the weighted sum of all textures in the fragment shader. Since you do not need the result as a texutre, you could directly render into the default framebuffer, so the result should become immediately visible.
With up to 9 textures you need the most, you can actually follow that strategy, since there will be enough texture units. However, that approach might be a bit inflexible, especially if you have to deal with a varying number of textures to sum up at different points in time.
It would be nice if you could just have a uniform variable with the count, and array of weight values, and a loop in the shader which would boil down to
uniform int count;
uniform float weights[MAX_COUNT];
uniform sampler2D uTex[MAX_COUNT];
[...]
for (i=0; i<count; i++)
sum += weight[i] * texture(uTex[i], texcoords);
And you can do that beginning with GL 4. It does support arrays of texture samplers, but requires that the access index is dynamically uniform, which means that all shader invocations are going to access the same texture samplers at the same time. As the loop only depends on a uniform variable, this is the case.
However, it might be a better strategey to just not use multiple textures. Assuming all of your input textures have the same resolution, you might be better off using just one texture array. You can attach a layer of such an array texture to an FBO as you can do with a ordinary 2D texture, so rendering to them indepedently (or rendering to multiple layers at a time using multiple render targets) will just work. You then only need to bind that single array texture and can do
uniform int count;
uniform float weights[MAX_COUNT];
uniform sampler2Darray uTex;
[...]
for (i=0; i<count; i++)
sum += weight[i] * texture(uTex, vec3(texcoords,i));
This only requires GL3 level hardware and the maximum count you can work with is not limited by the number of texture units available to the texture shader, but tby the texture array limit (typically > 256) and the available memory. However, the performance will go down if count gets too high. You might reach some point where actually using multiple passes where you only sum up a certain sub-range of your images becomes more efficient, due to the texture cache. In this approach, all the texture accesses of the different layers compete for the texture cache, negatively impacting the cache hit rate between neighboring fragments. But this should be no issue with just 8 or 9 input images.

OpenGL/GLSL faster way than imageStore() to set mutiple pixels of a texture?

I have a compute shader that is dispatched iteratively and uses a 2d texture to temporarily store values. Each invocation id accesses a particular row in the texture.
The problem is, this texture must be initialized to 0's before each shader dispatch.
Currently I use a loop at the end of the shader code that uses imageStore() to reset all pixels in the respective row back to 0.
for (uint i = 0; i < CONSTANT_SIZE; i++)
{
imageStore( myTexture, ivec2( i, global_invocation_id ), vec4( 0, 0, 0, 0) );
}
I was wondering if there is a faster way of doing this, a way to set more than one pixel with a single call (preferably an entire row)? I've looked at the GLSL 4.3 specification on image operations but I can't find one that doesn't require a specific pixel location.
If there is a faster way to achieve this on the CPU I would be open to that as well, i've tried rebuffering the texture using glTexImage2D(), but there is not really any noticeable performance changes to using imageStore for each individual pixel.

The "faster way" would be to clear the texture from OpenGL, rather than in your shader. 4.4 provides a direct texture clearing function, but even something as simple as a pixel transfer via glTexSubImage2D (after a barrier of course) would probably be faster than what you're doing.
Alternatively, if all you're using this texture for is scratch memory for invocations... why are you using a texture? It'd be better to use shared variables for that. Just create an array of arrays of vec4s, where each local invocation accesses one array of the arrays. Access to those are going to be loads faster.
Given 32KB of storage for shared variables (the bare minimum allowed), if you have 8 invocations per work group, that gives each one 4KB to work with. That gives each one 256 vec4s to play with. If you move up to 16 invocations, you reduce this to 128 vec4s.

Which mouse picking strategy for milions of primitives?

I am rendering models based on milions (up to ten) of triangles using VBOs and I need to detect which of these triangles the user may click on.
I try to read and understand how the "name stack" and the "unique-color" work.
I found that the name stack can contain at max only 128 names, while the unique-color can have up to 2^(8+8+8) = 16777216 possible different colors, but sometimes there could be some approximations, so it can get modified..
Which is the best strategy for my case?

Basically, you have 2 classes of options:
The "unique color way per triangle", which means you attach an id to every triangle, and render out the id's to a seperate render target. It can be 32 bit's (8 for rgb, 8 for a), but you could add a second one for even more ideas. It'll be fiddly to get the id's per triangle, but it's relatively easy to implement. Can be quite detrimental to performance though (fillrate).
Proper ray tracing. You almost certainly want to have an acceleration structure (octree, kd,...), but you probably already have one for frustum culling. One ray really isn't a lot, this method should be very fast.
Hybrid. probably the easiest to implement. Render out the vertex buffer id ("unique color per buffer:), and when you know which vertex buffer was selected", just trace a ray against all the triangles.
In the general case, I would say 2) is the best option. If you want to have something work quickly, go for 3). 1) is probably pretty useless.

If your GPU card has OpenGL 4.2, you may use the function imageStore() in GLSL to mark the triangle Id in an image. In my case, I need to detect all triangles behind a predefined window on the screen. Picking (choosing rendered triangles on a window) works similarly. The selection runs in real-time for me.
The maximum size of an image (or texture) should >= 8192x8192 = 64 M. So it can be used up to 64 M primitives (and even more if we use 2, 3 images).
Saving all trianges Id behind the screen could be done with this fragment shader:
uniform uimage2D id_image;
void main()
{
color_f = vec4(0)
ivec2 p;
p.x = gl_PrimitiveID % 2048;
p.y = gl_PrimitiveID / 2048;
imageStore(id_image, p, uvec4(255));
}
To save all trianges Id rendered on the screen: first, we precompute a depth buffer, then use a slightly different fragment shader:
uniform uimage2D id_image;
**layout(early_fragment_tests) in;** //the shader does not run for fragment > depth
void main()
{
color_f = vec4(0)
ivec2 p;
p.x = gl_PrimitiveID % 2048;
p.y = gl_PrimitiveID / 2048;
imageStore(id_image, p, uvec4(255));
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js