Let's say I have an OpenGL compute shader with local_size=8*8*8. How do the invocations map to nVidia GPU warps? Would invocations with the same gl_LocalInvocationID.x be in the same warp? Or y? Or z? I don't mean all invocations, I just mean general aggregation.
I am asking this because of optimizations as in one moment, not all invocations have work to do so I want them to be in the same warp.
The compute shader execution model allows the number of invocations to (greatly) exceed the number of individual execution units in a warp/wavefront. For example, hardware warp/wavefront sizes tend to be between 16 and 64, while the number of invocations within a work group (GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS) is required in OpenGL to be no less than 1024.
barrier calls and using shared variable data when a work group spans multiple warps/wavefronts works essentially by halting the progress of all warps/wavefronts until they each have passed that particular point. And then performing various memory flushing so that they can access each others' variables (based on memory barrier usage, of course). If all of the invocations in a work group fit into a single warp, then it's possible to avoid such things.
Basically, you have no control over how CS invocations are grouped into warps. You can assume that the implementation is not trying to be slow (that is, it will generally group invocations from the same work group into the same warp), but you cannot assume that all invocations within the same work group will be in the same warp.
Nor should you assume that each warp only executes invocations from the same work group.
According to this: https://www.khronos.org/opengl/wiki/Compute_Shader#Inputs
gl_LocalInvocationIndex =
gl_LocalInvocationID.z * gl_WorkGroupSize.x * gl_WorkGroupSize.y +
gl_LocalInvocationID.y * gl_WorkGroupSize.x +
gl_LocalInvocationID.x;
So it is quite safe to assume that invocations with the same gl_LocalInvocationID.x are in the same warp.
Related
I'm trying to perform multiple async 2D convolutions on a single image with multiple filters using NVIDIA's NPP library method nppiFilterBorder_32f_C1R_Ctx. However, even after creating multiple streams and assigning them to NPPI's method, the overlapping isn't happening; NVIDIA's nvvp informs the same:
That said, I'm confused if NPP supports overlapping context operations.
Below is a simplification of my code, only showing the async method calls and related variables:
std::vector<NppStreamContext> streams(n_filters);
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaStreamCreateWithFlags(&(streams[stream_idx].hStream), cudaStreamNonBlocking);
streams[stream_idx].nStreamFlags = cudaStreamNonBlocking;
// fill up NppStreamContext remaining fields
// malloc image and filter pointers
}
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaMemcpyAsync(..., streams[stream_idx].hStream);
nppiFilterBorder_32f_C1R_Ctx(..., streams[stream_idx]);
cudaMemcpy2DAsync(..., streams[stream_idx].hStream);
}
for(size_t stream_idx=0; stream_idx<n_filters; stream_idx++)
{
cudaStreamSynchronize(streams[stream_idx].hStream);
cudaStreamDestroy(streams[stream_idx].hStream);
}
Note: All the device pointers of the output images and input filters are stored in a std::vector, where I access them via the current stream index (e.g., float *ptr_filter_d = filters[stream_idx])
To summarize and add to the comments:
The profile does show small overlaps, so the answer to the title question is clearly yes.
The reason for the overlap being so small is just that each NPP kernel already needs all resources of the used GPU for most of its runtime. At the end of each kernel one can probably see the tail effect (i.e. the number of blocks is not a multiple of the number of blocks that can reside in SMs at each moment in time), so blocks from the next kernel are getting scheduled and there is some overlap.
It can sometimes be useful (i.e. an optimization) to force overlap between a big kernel which was started first and uses the full device and a later small kernel that only needs a few resources. In that case one can use stream priorities via cudaStreamCreateWithPriority to hint the scheduler to schedule blocks from the second kernel before blocks from the first kernel. An example of this can be found in this multi-GPU example (permalink).
In this case however, as the size of the kernels is the same and there is no reason to prioritize any of them over the others, forcing an overlap like this would not decrease the total runtime because the compute resources are limited. In the profiler view the kernels might then show more overlap but also each one would take more time. That is the reason why the scheduler does not overlap the kernels even though you allow it to do so by using multiple streams (See asynchronous vs. parallel).
To still increase performance, one could write a custom CUDA kernel that does all the filters in one kernel launch. The main reason that this could be a better than using NPP in this case is that all NPP kernels take the same input image. Therefore a single kernel could significantly decrease the number of accesses to global memory by reading in each tile of the input image only once (to shared memory, although L1 caching might suffice), then apply all the filters sequentially or in parallel (by splitting the thread block up into smaller units) and write out the results.
I have searched and read some wiki about incoherent memory access, such as https://en.wikipedia.org/wiki/Memory_coherence , https://www.khronos.org/opengl/wiki/Memory_Model
To my knowledge, the main cause of incoherent memory access is multiple local caches of same memory address. In cache memory architecture, the processor usually does not directly access memory but instead cache. For multi-processor system, each processor has its own cache but shares one memory, which may result multiple copies of same memory address. Therefore, one processor may read the old data of a memory address even after another process writes to this address.
However, shared variable should be located in cache and can only be accessed by the invocation within same workgroup which is executed on same processor. Thus there should not exist multiple versions of shared variable. Even when the size of shared variable exceeds the maximal size of cache, part of data would be in memory, the shared variable only exists in one cache. Why access to shared variable is incoherent?
Also, is accessing to coherent image/buffer variable by invocations within a workgroup incoherent?
supplement
To my understanding, there are two kinds of barriers in compute shader.
barrier function of glsl, which is used to control the execution of shader codes and makes sure that the preceding write operation is truely happened before the posterior read operation.
memory barriers are used to ensure the data of write operation visible to furture read operation. These types of memory barrier are set up for incoherent memory access. Also, in my understanding, incoherent memory access means the values written by a shader invocation not are necessarily visible to other invocation even when the read operation happens after write operation. The memory barriers are used to handle to this situation.
What I really want to ask is that regarding to invocations of one workgroup, could incoherent memory access I just described happens for shared, buffer or image variables? In other words, regarding to invocations of one workgroup, if I use barrier to make sure the read operation happens after the write operation, is the written value visible to read operation?
To my way of thinking, the written value is always visible to posterior read operation in the case above. Because, one workgroup only executes on one compute unit, thus no multiple caches exist for same memory address. But I am not sure about that.
Workgroup size limits are often many times larger than the wavefront/warp size of an actual execution unit. The point of collecting invocations into work groups is to be able to have them share information and have execution barriers between them.
If a workgroup is larger than an invocation subgroup (or if there is execution divergence due to divergent conditional execution), shared memory still needs to work. If the wavefront size is 32, but you have a workgroup size of 128, how can invocation number 97 read data written by invocation 2 when they're in different wavefronts?
An implementation could execute them sequentially on the same compute unit. Execute the first 32 invocations, then the next, etc. This would keep all shared memory in local storage, but how do you read data written by another invocation? You need an execution barrier so that the implementation knows to stop executing the current 32 and move to the next wavefront within the workgroup. Memory accesses are incoherent because you cannot know which invocations have even executed writes without an explicit barrier.
Any means of serial execution of such work groups would mean that shared memory has to be shared between different compute units. And that means caches.
I'm trying to implement forward+ rendering using compute shaders in GLSL 4.6, but I don't know how to synchronize threads within a work group when working with off-screen pixels. For example, my window resolution is 1600x900, and I'm using a work group size of 16x16, where each thread or invocation corresponds to a single pixel on the screen, this means that size_x = 1600/16 = 100 and size_y = 900/16 = 56.25, so I need to call
glDispatchCompute(100, 57, 1);
As you can see, some threads in a work group may represent pixels that extend beyond the screen, so I want to return early or discard these off-screen pixels to skip the complex computation. However, my compute shader also contains a barrier() call in several places in order to synchronize local threads, I don't know how to combine them. The documentation says
For any given static instance of barrier in a compute shader, all invocations within a single work group must enter it before any are allowed to continue beyond it.
......
Barriers are also disallowed after a return statement
The only workaround I can think of is to fake computations for these threads, or use if-else to let them finish early in an intermediate stage between two barrier() calls. I guess this will introduce a little performance penalty. So, is there a better way to rule out invalid threads in a work group? I believe this problem is quite common for compute shaders so there might be an idiomatic way of handling it.
my question concerns ComputeShader, HLSL code in particular. So, DeviceContext.Dispath(X, Y, Z) spawns X * Y * Z groups, each of which has x * y * z individual threads set in attribute [numthreads(x,y,z)]. The question is, how can I get total number of ThreadGroups dispatched and number of threads in a group? Let me explain why I want it - the amount of data I intend to process may vary significantly, so my methods should adapt to the size of input arrays. Of course I can send Dispath arguments in constant buffer to make it available from HLSL code, but what about number of threads in a group? I am looking for methods like GetThreadGroupNumber() and GetThreadNumberInGroup(). I appreciate any help.
The number of threads in a group is simply the product of the numthreads dimensions. For example, numthreads(32,8,4) will have 32*8*4 = 1024 threads per group. This can be determined statically at compile time.
The ID for a particular thread-group can be determined by adding a uint3 input argument with the SV_GroupId semantic.
The ID for a particular thread within a thread-group can be determined by adding a uint3 input argument with the SV_GroupThreadID semantic, or uint SV_GroupIndex if you prefer a flattened version.
As far as providing information to each thread on the total size of the dispatch, using a constant buffer is your best bet. This is analogous to the graphics pipeline, where the pixel shader doesn't naturally know the viewport dimensions.
It's also worth mentioning that if you do find yourself in a position where each thread needs to know the overall dispatch size, you should consider restructuring your algorithm. In general, it's better to dispatch a variable numbers of thread groups, each with a fixed amount of work, rather than dispatching a fixed number of threads with a variable amount of work. There are of course exceptions but this will tend provide better utilization of the hardware.
The following GLSL compute shader simply copies inImage to outImage. It is derived from a more complex post-processing pass.
In the first several lines of main(), a single thread loads 64 pixels of data into the shared array. Then, after synchronizing, each of the 64 threads writes one pixel to the output image.
Depending on how I synchronize, I get different results. I originally thought memoryBarrierShared() would be the correct call, but it produces the following result:
which is the same result as having no synchronization or using memoryBarrier() instead.
If I use barrier(), I get the following (desired) result:
The striping is 32 pixels wide, and if I change the workgroup size to anything less than or equal to 32, I get correct results.
What's going on here? Am I misunderstanding the purpose of memoryBarrierShared()? Why should barrier() work?
#version 430
#define SIZE 64
layout (local_size_x = SIZE, local_size_y = 1, local_size_z = 1) in;
layout(rgba32f) uniform readonly image2D inImage;
uniform writeonly image2D outImage;
shared vec4 shared_data[SIZE];
void main() {
ivec2 base = ivec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy);
ivec2 my_index = base + ivec2(gl_LocalInvocationID.x,0);
if (gl_LocalInvocationID.x == 0) {
for (int i = 0; i < SIZE; i++) {
shared_data[i] = imageLoad(inImage, base + ivec2(i,0));
}
}
// with no synchronization: stripes
// memoryBarrier(); // stripes
// memoryBarrierShared(); // stripes
// barrier(); // works
imageStore(outImage, my_index, shared_data[gl_LocalInvocationID.x]);
}
The problem with image load store and friends is, that the implementation cannot be sure anymore that a shader only changes the data of it's dedicated output values (e.g. the framebuffer after a fragment shader). This applies even more so to compute shaders, which don't have a dedicated output, but only output things by writing data into writable store, like images, storage buffers or atomic counters. This may require manual synchronization between individual passes as otherwise the fragment shader trying to access a texture might not have the most recent data written into that texture with image store operations by a preceding pass, like your compute shader.
So it may be that your compute shader works perfectly, but it is the synchronization with the following display (or whatever) pass (that needs to read this image data somehow) that fails. For this purpose there exists the glMemoryBarrier function. Depending on how you read that image data in the display pass (or more precisely the pass that reads the image after the compute shader pass), you need to give a different flag to this function. If you read it using a texture, use GL_TEXTURE_FETCH_BARRIER_BIT, if you use an image load again, use GL_SHADER_IMAGE_ACCESS_BARRIER_BIT, if using glBlitFramebuffer for display, use GL_FRAMEBUFFER_BARRIER_BIT...
Though I don't have much experience with image load/store and manual memory snynchronization and this is only what I came up with theoretically. So if anyone knows better or you already use a proper glMemoryBarrier, then feel free to correct me. Likewise does this not need to be your only error (if any). But the last two points from the linked Wiki article actually address your use case and IMHO make it clear that you need some kind of glMemoryBarrier:
Data written to image variables in one rendering pass and read by the shader in a later pass need not use coherent variables or
memoryBarrier(). Calling glMemoryBarrier with the
SHADER_IMAGE_ACCESS_BARRIER_BIT set in barriers between passes is
necessary.
Data written by the shader in one rendering pass and read by another mechanism (e.g., vertex or index buffer pulling) in a later pass need
not use coherent variables or memoryBarrier(). Calling
glMemoryBarrier with the appropriate bits set in barriers between
passes is necessary.
EDIT: Actually the Wiki article on compute shaders says
Shared variable access uses the rules for incoherent memory access.
This means that the user must perform certain synchronization in order
to ensure that shared variables are visible.
Shared variables are all implicitly declared coherent, so you don't
need to (and can't use) that qualifier. However, you still need to
provide an appropriate memory barrier.
The usual set of memory barriers is available to compute shaders, but
they also have access to memoryBarrierShared(); this barrier is
specifically for shared variable ordering. groupMemoryBarrier()
acts like memoryBarrier(), ordering memory writes for all kinds of
variables, but it only orders read/writes for the current work group.
While all invocations within a work group are said to execute "in parallel", that doesn't mean that you can assume that all of them are
executing in lock-step. If you need to ensure that an invocation has
written to some variable so that you can read it, you need to
synchronize execution with the invocations, not just issue a memory
barrier (you still need the memory barrier though).
To synchronize reads and writes between invocations within a work group, you must employ the barrier() function. This forces an
explicit synchronization between all invocations in the work group.
Execution within the work group will not proceed until all other
invocations have reach this barrier. Once past the barrier(), all
shared variables previously written across all invocations in the
group will be visible.
So this actually sounds like you need the barrier there and the memoryBarrierShared is not enough (though you don't need both, as the last sentence says). The memory barrier will just synchronize the memory, but it doesn't stop the execution of the threads to cross it. Thus the threads won't read any old cached data from the shared memory if the first thread has already written something, but they can very well reach the point of reading before the first thread has tried to write anything at all.
This actually fits perfectly to the fact that for 32 and below block sizes it works and that the first 32 pixels work. At least on NVIDIA hardware 32 is the warp size and thus the number of threads that operate in perfect lock-step. So the first 32 threads (well, every block of 32 threads) always work exactly parallel (well, conceptually that is) and thus they cannot introduce any race-conditions. This is also the case why you don't actually need any synchronization if you know you work inside a single warp, a common optimization.