Compute shader and workGroup

Compute shader and workGroup - glsl

I want to understand how to work with compute shaders. I didn't find any details on the Internet. What is workingGroup?
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
what it meaning?
vkCmdDispatch(cmdBuffer, 1, 1, 1);
Should the values in the Shader and in the function be the same?

For understanding these basic concepts of compute shaders, material for OpenCL, OpenGL, Metal, D3D, and CUDA compute would also be relevant: they all use a similar hierarchical grid subdivision of work.
The hierarchy, from finest to coarsest, in Vulkan terms is: invocation (aka thread) > subgroup > local workgroup > global workgroup (aka dispatch). Subgroups are a more advanced topic; you can ignore them for now as they're mostly implicit. Just to be confusing, people often just say "workgroup" when they mean "local workgroup".
The layout(local_size) declaration in your shader defines the dimensions of a local workgroup in terms of individual invocations. The parameters to vkCmdDispatch give the dimensions of the global workgroup, in terms of local workgroups.
So if you call vkCmdDispatch(cmdbuf, M, N, P) and the compute shader in the current pipeline declared layout (local_size_x=X, local_size_y=Y, local_size_z=Z), then Vulkan will run MxNxP local workgroups, each of which consists of XxYxZ invocations of your shader.
Within each invocation you can find out where it is within the local and global grids with the GLSL built-in input variables gl_NumWorkGroups, gl_WorkGroupID, gl_LocalInvocationID, gl_GlobalInvocationID, and gl_LocalInvocationIndex.

Related

DXR Descriptor Heap management for raytracing

After watching videos and reading the documentation on DXR and DX12, I'm still not sure how to manage resources for DX12 raytracing (DXR).
There is quite a difference between rasterizing and raytracing in terms of resource management, the main difference being that rasterizing has a lot of temporal resources that can be bound on the fly, and raytracing being in need of all resources being ready to go at the time of casting rays. The reason is obvious, a ray can hit anything in the whole scene, so we need to have every shader, every texture, every heap ready and filled with data before we cast a single ray.
So far so good.
My first test was adding all resources to a single heap - based on some DXR tutorials. The problem with this approach arises with objects having the same shaders but different textures. I defined 1 shader root signature for my single hit group, which I had to prepare before raytracing. But when creating a root signature, we have to exactly tell which position in the heap corresponds to the SRV where the texture is located. Since there are many textures with different positions in the heap, I would need to create 1 root signature per object with different textures. This of course is not preferred, since based on documentation and common sense, we should keep the root signature amount as small as possible.
Therefore, I discarded this test.
My second approach was creating a descriptor heap per object, which contained all local descriptors for this particular object (Textures, Constants etc..). The global resources = TLAS (Top Level Acceleration Structure), and the output and camera constant buffer were kept global in a separate heap. In this approach, I think I misunderstood the documentation by thinking I can add multiple heaps to a root signature. As I'm writing this post, I could not find a way of adding 2 separate heaps to a single root signature. If this is possible, I would love to know how, so any help is appreciated.
Here the code I'm usign for my root signature (using dx12 helpers):
bool PipelineState::CreateHitSignature(Microsoft::WRL::ComPtr<ID3D12RootSignature>& signature)
{
const auto device = RaytracingModule::GetInstance()->GetDevice();
if (device == nullptr)
{
return false;
}
nv_helpers_dx12::RootSignatureGenerator rsc;
rsc.AddRootParameter(D3D12_ROOT_PARAMETER_TYPE_SRV,0); // "t0" vertices and colors
// Add a single range pointing to the TLAS in the heap
rsc.AddHeapRangesParameter({
{2 /*t2*/, 1, 0, D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 1}, /* 2nd slot of the first heap */
{3 /*t3*/, 1, 0, D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 3}, /* 4nd slot of the first heap. Per-instance data */
});
signature = rsc.Generate(device, true);
return signature.Get() != nullptr;
}
Now my last approach would be to create a heap containing all necessary resources
-> TLAS, CBVs, SRVs (Textures) etc per object = 1x heap per object effectively. Again, as I was reading documentation, this was not advised, and documentation was stating that we should group resources to global heaps. At this point, I have a feeling I'm mixing DX12 and DXR documentation and best practices, by using proposals from DX12 in the DXR domain, which is probably wrong.
I also read partly through Nvidia Falcor source code and they seem to have 1 resource heap per descriptor type effectively limiting the number of descriptor heaps to a minimum (makes total sense) but I did not jet find how a root signature is created with multiple separate heaps.
I feel like I'm missing one last puzzle part to this mystery before it all falls into place and creates a beautiful image. So if anyone could explain how the resource management (heaps, descriptors etc.. ) should be handled in DXR if we want to have many objects which different resources, it would help me a lot.
So thanks in advance!
Jakub

With DXR you need to start at shader model 6.2 where dynamic indexing started to have a much more official support than just "the last descriptor is free to leak in seemingly-looking overrun indices" that was the "secret" approach in 5.1
Now you have full "bindless" using a type var[] : register(t4, 1); declarative syntax and you can index freely var[1] will access register (t5,1) etc.
You can setup register ranges in the descriptor table, so if you have 100 textures you can span 100.
You can even declare other resources after the array variable as long as you remember to jump all the registers. But it's easier to use different virtual spaces:
float4 ambiance : register(b0, 0);
Texture2D all_albedos[] : register(t0, 1);
matrix4x4 world : register(b1, 0);
Now you can go to t100 with no disturbance on the following space0 declarations.
The limit on the the register value is lifted in SM6. It's
up to max supported heap allocation
So all_albedos[3400].Sample(..) is a perfectly acceptable call (provided your heap has bound the views).
Unfortunatly in DX12 they give you the feeling you can bind multiple heaps with the CommandList::SetDescriptorHeaps function, but if you try you'll get runtime errors:
D3D12 ERROR: ID3D12CommandList::SetDescriptorHeaps: pDescriptorHeaps[1] sets a descriptor heap type that appears earlier in the pDescriptorHeaps array.
Only one of any given descriptor heap type can be set at a time. [ EXECUTION ERROR #554: SET_DESCRIPTOR_HEAP_INVALID]
It's misleading so don't trust that plural s in the method name.
Really if we have multiple heaps, that would only be because of triple buffering circular update/usage case, or upload/shader-visible I suppose. Just put everything in your one heap, and let the descriptor table index in it as demanded.
A descriptor table is a very lightweight element, it's just 3 ints. A descriptor start, a span and a virtual space. Just use that, you can span for 1000 textures if you have 1000 textures in your scene. You can get the material ID if you embed it into an indirection texture that would have unique UVs like a lightmap. Or in the vertex data, or just the whole hitgroup (if you setup for 1 hitgroup = 1 object). Your hitgroup index, which is given by a system value in the shader, will be your texture index.

Dynamic indexing of HLSL 5.1 might be the solution to this issue.
https://learn.microsoft.com/en-us/windows/win32/direct3d12/dynamic-indexing-using-hlsl-5-1
With dynamic indexing, we can create one heap containing all materials and use an index per object that will be used in the shader to take the correct material at run time
Therefore, we do not need multiple heaps of the same type, since it's not possible anyway. Only 1 heap per heap type is allowed at the same time

How can I access the size of a Compute Shader's local work group from the CPU?

Given a compute shader where I have set the local size of each dimension to the values x, y and z, is there any way for me to access that information from the c++ code? ie,
//Pseudo Code c++
int size[3]
x = get local sizes from linked compute shader
print(x);
//GLSL Code
layout (local_size_x = a number, local_size_y = a number, local_size_z = a number) in;

Having run around looking, I found the following on Khronos.org, on its page concerning glGetProgramiv, found here:
https://www.khronos.org/registry/OpenGL-Refpages/es3/html/glGetProgramiv.xhtml
GL_COMPUTE_WORK_GROUP_SIZE
params returns an array of three integers containing the local work group size of the compute program as specified by its input layout qualifier(s). program must be the name of a program object that has been previously linked successfully and contains a binary for the compute shader stage.
This makes the line I needed
glGetProgramiv(ComputeShaderID, GL_COMPUTE_WORK_GROUP_SIZE, localWorkGroupSize);
where localWorkGroupSize is an array of 3 integers.

Is there an equivalent to gl_LocalInvocationIndex in a HLSL compute shader?

Or do I need to calculate this myself? I can't find a reference for built in global variables in HLSL compute shaders.

This should be SV_GroupIndex, which, as described in msdn is :
The "flattened" index of a compute shader thread within a thread group, which turns the multi-dimensional SV_GroupThreadID into a 1D value. SV_GroupIndex varies from 0 to (numthreadsX * numthreadsY * numThreadsZ) – 1.
SV_GroupIndex = SV_GroupThreadID.z*dimx*dimy +
SV_GroupThreadID.y*dimx +
SV_GroupThreadID.x
MSDN Documentation Link

Manual depth rendering: Random results despite using atomic operations

i'm rendering single-pixel points into a uint32-texture with a compute shader. the texture is a 3d texture, x and y are viewport coordinates, z has depth information on coordinate 0 and additional attributes on 1. so two manually built rendertargets, if you will. code looks like this:
layout (r32ui, binding = 0) coherent volatile uniform uimage3D renderBuffer;
layout (rgba32f, binding = 1) restrict readonly uniform imageBuffer pointBuffer;
for(int j = 0; j < numPoints / gl_WorkGroupSize.x + 1; j++)
{
vec4 point = imageLoad(pointBuffer, ...)
// ... transform point ...
uint originalDepth = imageAtomicMin(renderBuffer, ivec3(imageCoords, 0), point.depth);
if (originalDepth >= point.depth)
{
// write happened, store the attributes
imageStore(renderBuffer, ivec3(imageCoords, 1), point.attributes);
}
}
while the depth values are correct, i have a few pixels where the attributes flicker between two values.
the order of points in the pointBuffer is random (but i've verified the set of all points is always the same), so my first thought was that two equal depth values might change the output, depending on which one comes first. so i made it that, if originalDepth == point.depth it uses imageAtomicMax to always have the same of the two alternative attributes written, but that changed nothing.
i scattered barrier() and memoryBarrier() all over the place, but that changed nothing. i also removed all diverging control flow for this, changed nothing.
reducing the local work size to 32 removes 90% of the flickering, but some still remains.
any ideas would be greatly appreciated.
edit: before you ask why i do this stuff manually instead of using normal rasterization and fragment shaders, the reason is performance. the rasterizer does not help since i'm rendering single-pixel-points, shared memory greatly speeded things up, and i render each point multiple times, which required me to use a geometry shader which was slow.

The problem is this: you have a race condition on writing to renderBuffer. If two different CS invocations map to the same pixel, and both of them decide to write the value, then there is a race on your imageStore call. One may overwrite the other, it may be a partial overwrite, or something else entirely. But in any case, it's not guaranteed to work.
This would be best solved by doing what rasterizers do: break the process down into two separate phases. The first phase does the ... transform point ... part, writing that data out to a buffer. The second phase then goes through the points and writes them to the final image.
In phase 2, each CS invocation performs all of the processing for a particular output pixel. That way, there are no race conditions. Of course, that requires that phase 1 produces data in a way that can be ordered per-pixel.
There are several ways to go about the latter. You could use a linked list, with a list per-pixel. Or your could use a list per-workgroup, where a workgroup represents some X/Y region of pixel space. In that case, you would use local shared memory as your local depth buffer, with all CS invocations reading from/writing to that region. After they all get done processing pixels, you write it out to real memory. Basically, you'd be implementing tile-based rendering manually.
Indeed, if you have a lot of these points, a tile-based solution would allow you to incorporate pipelining, so that you don't have to wait until all of phase 1 is done before starting on some of phase 2. You could break phase 1 down into chunks. You start a couple of phase 1 chunks, then a phase 2 chunk that reads from the first phase 1, then another phase 1, and so forth.
Vulkan with its event system, has better tools for building such an efficient dependency chain than OpenGL.

How do images work in opencl kernel?

I'm trying to find ways to copy multidimensional arrays from host to device in opencl and thought an approach was to use an image... which can be 1, 2, or 3 dimensional objects. However I'm confused because when reading a pixle from an array, they are using vector datatypes. Normally I would think double pointer, but it doesn't sound like that is what is meant by vector datatypes. Anyway here are my questions:
1) What is actually meant to vector datatype, why wouldn't we just specify 2 or 3 indices when denoting pixel coordinates? It looks like a single value such as float2 is being used to denote coordinates, but that makes no sense to me. I'm looking at the function read_imageui and read_image.
2) Can the input image just be a subset of the entire image and sampler be the subset of the input image? I don't understand how the coordinates are actually specified here either since read_image() only seams to take a single value for input and a single value for sampler.
3) If doing linear algebra, should I just bite the bullet and translate 1-D array data from the buffer into multi-dim arrays in opencl?
4) I'm still interested in images, so even if what I want to do is not best for images, could you still explain questions 1 and 2?
Thanks!
EDIT
I wanted to refine my question and ask, in the following khronos documentation they define...
int4 read_imagei (
image2d_t image,
sampler_t sampler,
int2 coord)
But nowhere can I find what image2d_t's definition or structure is supposed to be. The samething for sampler_t and int2 coord. They seem like structs to me or pointers to structs since opencl is supposed to be based on ansi c, but what are the fields of these structs or how do I note the coord with what looks like a scala?! I've seen the notation (int2)(x,y), but that's not ansi c, that looks like scala, haha. Things seem conflicting to me. Thanks again!

In general you can read from images in three different ways:
direct pixel access, no sampling
sampling, normalized coordinates
sampling, integer coordinates
The first one is what you want, that is, you pass integer pixel coordinates like (10, 43) and it will return the contents of the image at that point, with no filtering whatsoever, as if it were a memory buffer. You can use the read_image*() family of functions which take no sampler_t param.
The second one is what most people want from images, you specify normalized image coords between 0 and 1, and the return value is the interpolated image color at the specified point (so if your coordinates specify a point in between pixels, the color is interpolated based on surrounding pixel colors). The interpolation, and the way out-of-bounds coordinates are handled, are defined by the configuration of the sampler_t parameter you pass to the function.
The third one is the same as the second one, except the texture coordinates are not normalized, and the sampler needs to be configured accordingly. In some sense the third way is closer to the first, and the only additional feature it provides is the ability to handle out-of-bounds pixel coordinates (for instance, by wrapping or clamping them) instead of you doing it manually.
Finally, the different versions of each function, e.g. read_imagef, read_imagei, read_imageui are to be used depending on the pixel format of your image. If it contains floats (in each channel), use read_imagef, if it contains signed integers (in each channel), use read_imagei, etc...
Writing to an image on the other hand is straightforward, there are write_image{f,i,ui}() functions that take an image object, integer pixel coordinates and a pixel color, all very easy.
Note that you cannot read and write to the same image in the same kernel! (I don't know if recent OpenCL versions have changed that). In general I would recommend using a buffer if you are not going to be using images as actual images (i.e. input textures that you sample or output textures that you write to only once at the end of your kernel).
About the image2d_t, sampler_t types, they are OpenCL "pseudo-objects" that you can pass into a kernel from C (they are reserved types). You send your image or your sampler from the C side into clSetKernelArg, and the kernel gets back a sampler_t or an image2d_t in the kernel's parameter list (just like you pass in a buffer object and it gets a pointer). The objects themselves cannot be meaningfully manipulated inside the kernel, they are just handles that you can send into the read_image/write_image functions, along with a few others.
As for the "actual" low-level difference between images and buffers, GPU's often have specially reserved texture memory that is highly optimized for "read often, write once" access patterns, with special texture sampling hardware and texture caches to optimize scatter reads, mipmaps, etc..
On the CPU there is probably no underlying difference between an image and a buffer, and your runtime likely implements both as memory arrays while enforcing image semantics.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js