I'm trying to implement forward+ rendering using compute shaders in GLSL 4.6, but I don't know how to synchronize threads within a work group when working with off-screen pixels. For example, my window resolution is 1600x900, and I'm using a work group size of 16x16, where each thread or invocation corresponds to a single pixel on the screen, this means that size_x = 1600/16 = 100 and size_y = 900/16 = 56.25, so I need to call
glDispatchCompute(100, 57, 1);
As you can see, some threads in a work group may represent pixels that extend beyond the screen, so I want to return early or discard these off-screen pixels to skip the complex computation. However, my compute shader also contains a barrier() call in several places in order to synchronize local threads, I don't know how to combine them. The documentation says
For any given static instance of barrier in a compute shader, all invocations within a single work group must enter it before any are allowed to continue beyond it.
......
Barriers are also disallowed after a return statement
The only workaround I can think of is to fake computations for these threads, or use if-else to let them finish early in an intermediate stage between two barrier() calls. I guess this will introduce a little performance penalty. So, is there a better way to rule out invalid threads in a work group? I believe this problem is quite common for compute shaders so there might be an idiomatic way of handling it.
Related
I'm trying to perform multiple async 2D convolutions on a single image with multiple filters using NVIDIA's NPP library method nppiFilterBorder_32f_C1R_Ctx. However, even after creating multiple streams and assigning them to NPPI's method, the overlapping isn't happening; NVIDIA's nvvp informs the same:
That said, I'm confused if NPP supports overlapping context operations.
Below is a simplification of my code, only showing the async method calls and related variables:
std::vector<NppStreamContext> streams(n_filters);
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaStreamCreateWithFlags(&(streams[stream_idx].hStream), cudaStreamNonBlocking);
streams[stream_idx].nStreamFlags = cudaStreamNonBlocking;
// fill up NppStreamContext remaining fields
// malloc image and filter pointers
}
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaMemcpyAsync(..., streams[stream_idx].hStream);
nppiFilterBorder_32f_C1R_Ctx(..., streams[stream_idx]);
cudaMemcpy2DAsync(..., streams[stream_idx].hStream);
}
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaStreamSynchronize(streams[stream_idx].hStream);
cudaStreamDestroy(streams[stream_idx].hStream);
}
Note: All the device pointers of the output images and input filters are stored in a std::vector, where I access them via the current stream index (e.g., float *ptr_filter_d = filters[stream_idx])
To summarize and add to the comments:
The profile does show small overlaps, so the answer to the title question is clearly yes.
The reason for the overlap being so small is just that each NPP kernel already needs all resources of the used GPU for most of its runtime. At the end of each kernel one can probably see the tail effect (i.e. the number of blocks is not a multiple of the number of blocks that can reside in SMs at each moment in time), so blocks from the next kernel are getting scheduled and there is some overlap.
It can sometimes be useful (i.e. an optimization) to force overlap between a big kernel which was started first and uses the full device and a later small kernel that only needs a few resources. In that case one can use stream priorities via cudaStreamCreateWithPriority to hint the scheduler to schedule blocks from the second kernel before blocks from the first kernel. An example of this can be found in this multi-GPU example (permalink).
In this case however, as the size of the kernels is the same and there is no reason to prioritize any of them over the others, forcing an overlap like this would not decrease the total runtime because the compute resources are limited. In the profiler view the kernels might then show more overlap but also each one would take more time. That is the reason why the scheduler does not overlap the kernels even though you allow it to do so by using multiple streams (See asynchronous vs. parallel).
To still increase performance, one could write a custom CUDA kernel that does all the filters in one kernel launch. The main reason that this could be a better than using NPP in this case is that all NPP kernels take the same input image. Therefore a single kernel could significantly decrease the number of accesses to global memory by reading in each tile of the input image only once (to shared memory, although L1 caching might suffice), then apply all the filters sequentially or in parallel (by splitting the thread block up into smaller units) and write out the results.
I'm facing a problem where the use of an occlusion query in combination with instanced rendering would be desirable.
As far as I understood, something like
glBeginQuery(GL_ANY_SAMPLES_PASSED, occlusionQuery);
glDrawArraysInstanced(mode, i, j, countInstances);
glEndQuery(GL_ANY_SAMPLES_PASSED);
will only tell me, if any of the instances were drawn.
What I would need to know is, what set of instances has been drawn (giving me the IDs of all visible instances). Drawing each instance in an own call is no option for me.
An alternative would be to color-code the instances and detect the visible instances manually.
But is there really no way to solve this problem with a query command and why would it not be possible?
It's not possible for several reasons.
Query objects only contain a single counter value. What you want would require a separate sample passed count for each instance.
Even if query objects stored arrays of sample counts, you can issue more than one draw call in the begin/end scope of a query. So how would OpenGL know which part of which draw call belonged to which query value in the array? You can even change other state within the query scope; uniform bindings, programs, pretty much anything.
The samples-passed count is determined entirely by the rasterizer hardware on the GPU. And the rasterizer neither knows nor cares which instance generated a triangle.
Instancing is a function of the vertex processing and/or vertex specification stages; by the time the rasterizer sees it, that information is gone. Notice that fragment shaders don't even get an instance ID as input, unless you explicitly create one by passing it from your vertex processing stage(s).
However, if you truly want to do this you could use image load/store and its atomic operations. That is, pass the fragment shader the instance in question (as an int data type, with flat interpolation). This FS also uses a uimageBuffer buffer texture, which uses the GL_R32UI format (or you can use an SSBO unbounded array). It then performs an imageAtomicAdd, using the instance value passed in as the index to the buffer. Oh, and you'll need to have the FS explicitly require early tests, so that samples which fail the fragment tests will not execute.
Then use a compute shader to build up a list of rendering commands for the instances which have non-zero values in the array. Then use an indirect rendering call to draw the results of this computation. Now obviously, you will need to properly synchronize access between these various operations. So you'll need to use appropriate glMemoryBarrier calls between each one.
Even if queries worked the way you want them to, this would be overall far more preferable than using a query object. Unless you're reading a query into a buffer object, reading a query object requires a GPU/CPU synchronization of some form. Whereas the above requires some synchronization and barrier operations, but they're all on-GPU operations, rather than synchronizing with the CPU.
I'd like to enumerate those general, fundamental circumstances under which multi-pass rendering becomes an unavoidable necessity, as opposed to keeping everything within the same shader program. Here's what I've come up with so far.
When a result requires non-local fragment information (i.e. context) around the current fragment, e.g. for box filters, then a previous pass must have supplied this;
When a result needs hardware interpolation done by a prior pass;
When a result acts as pre-cache of some set of calculations that enables substantially better performance than simply (re-)working through the entire set of calculations in those passes that use them, e.g. transforming each fragment of the depth buffer in a particular and costly way, which multiple later-pass shaders can then share, rather than each repeating those calculations. So, calculate once, use more than once.
I note from my own (naive) deductions above that vertex and geometry shaders don't really seem to come into the picture of deferred rendering, and so are probably usually done in first pass; to me this seems sensible, but either affirmation or negation of this, with detail, would be of interest.
P.S. I am going to leave this question open to gather good answers, so don't expect quick wins!
Nice topic. For me since I'm a beginner I would say to avoid unnecessary calculations in the pixel/fragment shader you get when you use forward rendering.
With forward rendering you have to do a pass for every light you have in your scene, even if the pixel colors aren't affected.
But that's just a comparison between forward rendering and deferred rendering.
As opposed to keeping everything in the same shader program, the simplest thing I can think of is the fact that you aren't restricted to use N number of lights in your scene, since in for instance GLSL you can use either separate lights or store them in a uniform array. Then again you can also use forward rendering, but if you have a lot of lights in your scene forward rendering has a too expensive pixel/fragment shader.
That's all I really know so I would like to hear other theories as well.
Deferred / multi-pass approaches are used when the results of the depth buffer are needed (produced by rendering basic geometry) in order to produce complex pixel / fragment shading effects based on depth, such as:
Edge / silhouette detection
Lighting
And also application logic:
GPU picking, which requires the depth buffer for ray calculation, and uniquely-coloured / ID'ed geometries in another buffer for identification of "who" was hit.
To make a long story short, am I better off doing this:
if (normalMappingEnabled)
{
normal = calculateBumpedNormalFromTexture();
}
else
{
normal = somethingMuchEasierToCalculate();
}
Where normalMappingEnabled is a uniform, calculateBumpedNormalFromTexture requires a texture lookup and all of the other math required for normal mapping, and somethingMuchEasierToCalculate requires no texture lookup, and simply spits out the interpolated per-vertex normal.
or this:
normal = calculateBumpedNormalFromTexture();
Where in this case, if I don't need normal mapping, the normal texture is 1x1, containing a single texel that points straight up, producing the same result as if I had just used an interpolated per-vertex normal.
Which is faster on most modern hardware? Is there another solution I haven't considered (other than using 2 different shaders)?
If the condition is the same for all invocated fragments, then no branch divergence will happen. So AFAIK in this case, there will be no performance loss.
The problem is when some threads need to execute one branch and some threads the other branch. Since different instructions can't be executed in parallel (on one processor), both branches would be executed sequentially (some threads would get executed, the other part would wait and then the other part would get executed and the first part would wait).
The following GLSL compute shader simply copies inImage to outImage. It is derived from a more complex post-processing pass.
In the first several lines of main(), a single thread loads 64 pixels of data into the shared array. Then, after synchronizing, each of the 64 threads writes one pixel to the output image.
Depending on how I synchronize, I get different results. I originally thought memoryBarrierShared() would be the correct call, but it produces the following result:
which is the same result as having no synchronization or using memoryBarrier() instead.
If I use barrier(), I get the following (desired) result:
The striping is 32 pixels wide, and if I change the workgroup size to anything less than or equal to 32, I get correct results.
What's going on here? Am I misunderstanding the purpose of memoryBarrierShared()? Why should barrier() work?
#version 430
#define SIZE 64
layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;
layout(rgba32f) uniform readonly image2D inImage;
uniform writeonly image2D outImage;
shared vec4 shared_data[SIZE];
void main() {
ivec2 base = ivec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy);
ivec2 my_index = base + ivec2(gl_LocalInvocationID.x,0);
if (gl_LocalInvocationID.x == 0) {
for (int i = 0; i < SIZE; i++) {
shared_data[i] = imageLoad(inImage, base + ivec2(i,0));
}
}
// with no synchronization: stripes
// memoryBarrier(); // stripes
// memoryBarrierShared(); // stripes
// barrier(); // works
imageStore(outImage, my_index, shared_data[gl_LocalInvocationID.x]);
}
The problem with image load store and friends is, that the implementation cannot be sure anymore that a shader only changes the data of it's dedicated output values (e.g. the framebuffer after a fragment shader). This applies even more so to compute shaders, which don't have a dedicated output, but only output things by writing data into writable store, like images, storage buffers or atomic counters. This may require manual synchronization between individual passes as otherwise the fragment shader trying to access a texture might not have the most recent data written into that texture with image store operations by a preceding pass, like your compute shader.
So it may be that your compute shader works perfectly, but it is the synchronization with the following display (or whatever) pass (that needs to read this image data somehow) that fails. For this purpose there exists the glMemoryBarrier function. Depending on how you read that image data in the display pass (or more precisely the pass that reads the image after the compute shader pass), you need to give a different flag to this function. If you read it using a texture, use GL_TEXTURE_FETCH_BARRIER_BIT, if you use an image load again, use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT, if using glBlitFramebuffer for display, use GL_FRAMEBUFFER_BARRIER_BIT...
Though I don't have much experience with image load/store and manual memory snynchronization and this is only what I came up with theoretically. So if anyone knows better or you already use a proper glMemoryBarrier, then feel free to correct me. Likewise does this not need to be your only error (if any). But the last two points from the linked Wiki article actually address your use case and IMHO make it clear that you need some kind of glMemoryBarrier:
Data written to image variables in one rendering pass and read by the shader in a later pass need not use coherent variables or
memoryBarrier(). Calling glMemoryBarrier with the
SHADER_IMAGE_ACCESS_BARRIER_BIT set in barriers between passes is
necessary.
Data written by the shader in one rendering pass and read by another mechanism (e.g., vertex or index buffer pulling) in a later pass need
not use coherent variables or memoryBarrier(). Calling
glMemoryBarrier with the appropriate bits set in barriers between
passes is necessary.
EDIT: Actually the Wiki article on compute shaders says
Shared variable access uses the rules for incoherent memory access.
This means that the user must perform certain synchronization in order
to ensure that shared variables are visible.
Shared variables are all implicitly declared coherent, so you don't
need to (and can't use) that qualifier. However, you still need to
provide an appropriate memory barrier.
The usual set of memory barriers is available to compute shaders, but
they also have access to memoryBarrierShared(); this barrier is
specifically for shared variable ordering. groupMemoryBarrier()
acts like memoryBarrier(), ordering memory writes for all kinds of
variables, but it only orders read/writes for the current work group.
While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are
executing in lock-step. If you need to ensure that an invocation has
written to some variable so that you can read it, you need to
synchronize execution with the invocations, not just issue a memory
barrier (you still need the memory barrier though).
To synchronize reads and writes between invocations within a work group, you must employ the barrier() function. This forces an
explicit synchronization between all invocations in the work group.
Execution within the work group will not proceed until all other
invocations have reach this barrier. Once past the barrier(), all
shared variables previously written across all invocations in the
group will be visible.
So this actually sounds like you need the barrier there and the memoryBarrierShared is not enough (though you don't need both, as the last sentence says). The memory barrier will just synchronize the memory, but it doesn't stop the execution of the threads to cross it. Thus the threads won't read any old cached data from the shared memory if the first thread has already written something, but they can very well reach the point of reading before the first thread has tried to write anything at all.
This actually fits perfectly to the fact that for 32 and below block sizes it works and that the first 32 pixels work. At least on NVIDIA hardware 32 is the warp size and thus the number of threads that operate in perfect lock-step. So the first 32 threads (well, every block of 32 threads) always work exactly parallel (well, conceptually that is) and thus they cannot introduce any race-conditions. This is also the case why you don't actually need any synchronization if you know you work inside a single warp, a common optimization.