DXR Descriptor Heap management for raytracing - c++

After watching videos and reading the documentation on DXR and DX12, I'm still not sure how to manage resources for DX12 raytracing (DXR).
There is quite a difference between rasterizing and raytracing in terms of resource management, the main difference being that rasterizing has a lot of temporal resources that can be bound on the fly, and raytracing being in need of all resources being ready to go at the time of casting rays. The reason is obvious, a ray can hit anything in the whole scene, so we need to have every shader, every texture, every heap ready and filled with data before we cast a single ray.
So far so good.
My first test was adding all resources to a single heap - based on some DXR tutorials. The problem with this approach arises with objects having the same shaders but different textures. I defined 1 shader root signature for my single hit group, which I had to prepare before raytracing. But when creating a root signature, we have to exactly tell which position in the heap corresponds to the SRV where the texture is located. Since there are many textures with different positions in the heap, I would need to create 1 root signature per object with different textures. This of course is not preferred, since based on documentation and common sense, we should keep the root signature amount as small as possible.
Therefore, I discarded this test.
My second approach was creating a descriptor heap per object, which contained all local descriptors for this particular object (Textures, Constants etc..). The global resources = TLAS (Top Level Acceleration Structure), and the output and camera constant buffer were kept global in a separate heap. In this approach, I think I misunderstood the documentation by thinking I can add multiple heaps to a root signature. As I'm writing this post, I could not find a way of adding 2 separate heaps to a single root signature. If this is possible, I would love to know how, so any help is appreciated.
Here the code I'm usign for my root signature (using dx12 helpers):
bool PipelineState::CreateHitSignature(Microsoft::WRL::ComPtr<ID3D12RootSignature>& signature)
{
const auto device = RaytracingModule::GetInstance()->GetDevice();
if (device == nullptr)
{
return false;
}
nv_helpers_dx12::RootSignatureGenerator rsc;
rsc.AddRootParameter(D3D12_ROOT_PARAMETER_TYPE_SRV,0); // "t0" vertices and colors
// Add a single range pointing to the TLAS in the heap
rsc.AddHeapRangesParameter({
{2 /*t2*/, 1, 0, D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 1}, /* 2nd slot of the first heap */
{3 /*t3*/, 1, 0, D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 3}, /* 4nd slot of the first heap. Per-instance data */
});
signature = rsc.Generate(device, true);
return signature.Get() != nullptr;
}
Now my last approach would be to create a heap containing all necessary resources
-> TLAS, CBVs, SRVs (Textures) etc per object = 1x heap per object effectively. Again, as I was reading documentation, this was not advised, and documentation was stating that we should group resources to global heaps. At this point, I have a feeling I'm mixing DX12 and DXR documentation and best practices, by using proposals from DX12 in the DXR domain, which is probably wrong.
I also read partly through Nvidia Falcor source code and they seem to have 1 resource heap per descriptor type effectively limiting the number of descriptor heaps to a minimum (makes total sense) but I did not jet find how a root signature is created with multiple separate heaps.
I feel like I'm missing one last puzzle part to this mystery before it all falls into place and creates a beautiful image. So if anyone could explain how the resource management (heaps, descriptors etc.. ) should be handled in DXR if we want to have many objects which different resources, it would help me a lot.
So thanks in advance!
Jakub

With DXR you need to start at shader model 6.2 where dynamic indexing started to have a much more official support than just "the last descriptor is free to leak in seemingly-looking overrun indices" that was the "secret" approach in 5.1
Now you have full "bindless" using a type var[] : register(t4, 1); declarative syntax and you can index freely var[1] will access register (t5,1) etc.
You can setup register ranges in the descriptor table, so if you have 100 textures you can span 100.
You can even declare other resources after the array variable as long as you remember to jump all the registers. But it's easier to use different virtual spaces:
float4 ambiance : register(b0, 0);
Texture2D all_albedos[] : register(t0, 1);
matrix4x4 world : register(b1, 0);
Now you can go to t100 with no disturbance on the following space0 declarations.
The limit on the the register value is lifted in SM6. It's
up to max supported heap allocation
So all_albedos[3400].Sample(..) is a perfectly acceptable call (provided your heap has bound the views).
Unfortunatly in DX12 they give you the feeling you can bind multiple heaps with the CommandList::SetDescriptorHeaps function, but if you try you'll get runtime errors:
D3D12 ERROR: ID3D12CommandList::SetDescriptorHeaps: pDescriptorHeaps[1] sets a descriptor heap type that appears earlier in the pDescriptorHeaps array.
Only one of any given descriptor heap type can be set at a time. [ EXECUTION ERROR #554: SET_DESCRIPTOR_HEAP_INVALID]
It's misleading so don't trust that plural s in the method name.
Really if we have multiple heaps, that would only be because of triple buffering circular update/usage case, or upload/shader-visible I suppose. Just put everything in your one heap, and let the descriptor table index in it as demanded.
A descriptor table is a very lightweight element, it's just 3 ints. A descriptor start, a span and a virtual space. Just use that, you can span for 1000 textures if you have 1000 textures in your scene. You can get the material ID if you embed it into an indirection texture that would have unique UVs like a lightmap. Or in the vertex data, or just the whole hitgroup (if you setup for 1 hitgroup = 1 object). Your hitgroup index, which is given by a system value in the shader, will be your texture index.

Dynamic indexing of HLSL 5.1 might be the solution to this issue.
https://learn.microsoft.com/en-us/windows/win32/direct3d12/dynamic-indexing-using-hlsl-5-1
With dynamic indexing, we can create one heap containing all materials and use an index per object that will be used in the shader to take the correct material at run time
Therefore, we do not need multiple heaps of the same type, since it's not possible anyway. Only 1 heap per heap type is allowed at the same time

Related

Memory Management of update Method in Texture Pixel Manipulation

How is the array of pixels, that is passed to the update method in the Texture class (SFML), managed memory-wise? These are some of my guesses:
A weak pointer is saved inside the texture instance; which means that it is necessary to keep a pointer to the array of pixels of your own and manage it yourself.
The array is copied and managed by the texture (which also means that every time the update method is called again, the previous one is deallocated).
The second guess would justify this for updating a texture multiple times:
auto newPixels = new sf::Uint8[WIDTH * HEIGHT * 4];
... //do stuff to pixels
texture.update(newPixels);
Where the pixels are reallocated every time the texture is updated. Otherwise (if the pixels are just stored as a weak pointer and not managed/deallocated/allocated) a different approach would be necessary, where the pixels are managed by the user...
Thanks in advance for any answers :)
SFML is open source. You don't need to take guesses or ask here. You can just read it for yourself:
https://github.com/SFML/SFML/blob/master/src/SFML/Graphics/Texture.cpp#L390
Specifically, the pointer is passed to the glTexSubImage2D OpenGL method.

Draw multiple meshes to different locations (DirectX 12)

I have a problem with DirectX 12. I have made a small 3D renderer. Models are translated to 3D space in vertex shader with basic World View Projection matrixes that are in constant buffer.
To change data of the constant buffer i'm currently using memcpy(pMappedConstantBuffer + alignedSize * frame, newConstantBufferData, alignedSize) this command replaces constant buffer's data immediately.
So the problem comes here, drawing is recorded to a command list that will be later sent to the gpu for execution.
Example:
/* Now i want to change the constant buffer to change the next draw call's position to (0, 1, 0) */
memcpy(/*Parameters*/);
/* Now i want to record a draw call to the command list */
DrawInstanced(/*Parameters*/);
/* But now i want to draw other mesh to other position so i have to change the constant buffer. After this memcpy() the draw position will be (0, -1, 0) */
memcpy(/*Parameters*/);
/* Now i want to record new draw call to the list */
DrawInstanced(/*Parameters*/);
After this i sent the command list to gpu for execution, but quess what all the meshes will be in the same position, because all memcpys are executed before even the command list is sent to gpu. So basically the last memcpy overwrites the previous ones.
So basically the question is how do i draw meshes to different positions or how to replace constant buffer's data in the command list so the constant buffer changes between each draw call on gpu?
Thanks
No need for help anymore i solved it by myself. I created constant buffer for each mesh.
About execution order, you are totally right, you memcpy calls will update the buffers immediately, but the commands will not be processed until you push your command list in the queue (and you will not exactly know when this will happen).
In Direct3D11, when you use Map on a buffer, this is handled for you (some space will be allocated to avoid that if required).
So In Direct3D12 you have several choices, I'll consider that you want to draw N objects, and you want to store one matrix per object in your cbuffer.
First is to create one buffer per object and set data independently. If you have only a few, this is easy to maintain (and extra memory footprint due to resource allocations will be ok)
Other option is to create a large buffer (which can contain N matrices), and create N constant buffer views that points to the memory location of each object. (Please note that you also have to respect 256 bytes alignment in that case too, see CreateConstantBufferView).
You can also use a StructuredBuffer and copy all data into it (in that case you do not need the alignment), and use an index in the vertex shader to lookup the correct matrix. (it is possible to set a uint value in your shader and use SetGraphicsRoot32BitConstant to apply it directly).

Manual depth rendering: Random results despite using atomic operations

i'm rendering single-pixel points into a uint32-texture with a compute shader. the texture is a 3d texture, x and y are viewport coordinates, z has depth information on coordinate 0 and additional attributes on 1. so two manually built rendertargets, if you will. code looks like this:
layout (r32ui, binding = 0) coherent volatile uniform uimage3D renderBuffer;
layout (rgba32f, binding = 1) restrict readonly uniform imageBuffer pointBuffer;
for(int j = 0; j < numPoints / gl_WorkGroupSize.x + 1; j++)
{
vec4 point = imageLoad(pointBuffer, ...)
// ... transform point ...
uint originalDepth = imageAtomicMin(renderBuffer, ivec3(imageCoords, 0), point.depth);
if (originalDepth >= point.depth)
{
// write happened, store the attributes
imageStore(renderBuffer, ivec3(imageCoords, 1), point.attributes);
}
}
while the depth values are correct, i have a few pixels where the attributes flicker between two values.
the order of points in the pointBuffer is random (but i've verified the set of all points is always the same), so my first thought was that two equal depth values might change the output, depending on which one comes first. so i made it that, if originalDepth == point.depth it uses imageAtomicMax to always have the same of the two alternative attributes written, but that changed nothing.
i scattered barrier() and memoryBarrier() all over the place, but that changed nothing. i also removed all diverging control flow for this, changed nothing.
reducing the local work size to 32 removes 90% of the flickering, but some still remains.
any ideas would be greatly appreciated.
edit: before you ask why i do this stuff manually instead of using normal rasterization and fragment shaders, the reason is performance. the rasterizer does not help since i'm rendering single-pixel-points, shared memory greatly speeded things up, and i render each point multiple times, which required me to use a geometry shader which was slow.
The problem is this: you have a race condition on writing to renderBuffer. If two different CS invocations map to the same pixel, and both of them decide to write the value, then there is a race on your imageStore call. One may overwrite the other, it may be a partial overwrite, or something else entirely. But in any case, it's not guaranteed to work.
This would be best solved by doing what rasterizers do: break the process down into two separate phases. The first phase does the ... transform point ... part, writing that data out to a buffer. The second phase then goes through the points and writes them to the final image.
In phase 2, each CS invocation performs all of the processing for a particular output pixel. That way, there are no race conditions. Of course, that requires that phase 1 produces data in a way that can be ordered per-pixel.
There are several ways to go about the latter. You could use a linked list, with a list per-pixel. Or your could use a list per-workgroup, where a workgroup represents some X/Y region of pixel space. In that case, you would use local shared memory as your local depth buffer, with all CS invocations reading from/writing to that region. After they all get done processing pixels, you write it out to real memory. Basically, you'd be implementing tile-based rendering manually.
Indeed, if you have a lot of these points, a tile-based solution would allow you to incorporate pipelining, so that you don't have to wait until all of phase 1 is done before starting on some of phase 2. You could break phase 1 down into chunks. You start a couple of phase 1 chunks, then a phase 2 chunk that reads from the first phase 1, then another phase 1, and so forth.
Vulkan with its event system, has better tools for building such an efficient dependency chain than OpenGL.

C++ Maya - Getting mesh vertices from frame and subframe

I'm writing a mesh deformer plugin that gets info about the mesh from past frames to perform some calculations. In the past, to get past mesh info, I did the following
MStatus MyClass::deform(MDataBlock& dataBlock, MItGeometry& itGeo,
const MMatrix& localToWorldMatrix, unsigned int index)
{
MFnPointArrayData fnPoints;
//... other init code
MPlug meshPlug = nodeFn.findPlug(MString("inputMesh"));
// gets the mesh connection from the previous frame
MPlug meshPositionPlug = meshPlug.elementByLogicalIndex(0);
MObject objOldMesh;
meshPositionPlug.getValue(objOldMesh);
fnPoints.setObject(objOldMesh);
// previous frame's vertices
MPointArray oldMeshPositionVertices = fnPoints.array();
// ... calculations
return MS::kSuccess;
}
If I needed more than one frame I'd run for-loops over logical indices and repeat the process. Since creating this however, I've found that the needs of my plugin can't just get past frames but also frames in the future as well as subframes (between integer frames). Since my current code relies on elementByLogicalIndex() to get past frame info and that only takes unsigned integers, and the 0th index refers to the previous frame, I can't get subframe information. I haven't tried getting future frame info yet but I don't think that's possible either.
How do I query mesh vertex positions in an array for past/future/sub-frames? Is my current method inflexible and, if so, how else could I do this?
So, the "intended" way to accomplish this is with an MDGContext, either with an MDGContextGuard, or with the versions of MPlug.asMObject that explicitly take a context (though these are deprecated).
Having said that - in the past when I've tried to use MDGContexts to query values at other times, I've found them either VERY slow, unstable, or both. So use with caution. It's possible that things will work better if, as you say, you're dealing purely with objects coming straight from an alembic mesh. However, if that's the case, you may have better luck reading the cache path from the node, and querying through the alembic API directly yourself.

Does an immutable texture need a GL_TEXTURE_MAX_LEVEL?

When allocating textures using glTexImage* functions, I know that I need to set glTexParameteri(GL_TEXTURE_MAX_LEVEL) to a reasonable value and specify all the levels up to that value, as described here.
I didn't expect for this to be necessary in case of glTexStorage* functions too, since they accept the number of layers as a parameter and allocate memory for that number of layers up-front. Still, I noticed I couldn't sample an immutable texture defined this way - until I called glGenerateMipmap or specified GL_TEXTURE_MAX_LEVEL to levels-1.
I didn't find any official reason why it should be necessary and I expected immutable texture's parameters to be, well, immutable (and well-initialized). Can somebody confirm if (and why) this behaviour is correct? Or is it an AMD driver bug perhaps?
OK, I think I got that:
The parameter levels of glTexStorage is indeed stored in the texture object, but as GL_TEXTURE_IMMUTABLE_LEVELS, not as GL_TEXTURE_MAX_LEVEL, as I thought.
The parameter GL_TEXTURE_MAX_LEVEL hence remains at the default large value. (It's possible to change it manually: the immutable flag of texture object only relates to the texture buffer and its format, but not buffer data or parameters).
The texture immutability should affect LOD calculation in the following way according to the spec:
if TEXTURE_IMMUTABLE_FORMAT is TRUE, then levelbase is clamped
to the range [0; levelimmut - 1]
So leaving GL_TEXTURE_MAX_LEVEL intact (= 1000) for an immutable texture shall have the same effect as setting it to levels-1.
Verdict: driver bug; the driver apparently omits this clamping step.
I know that I need to set glTexParameteri(GL_TEXTURE_MAX_LEVEL) to a reasonable value and specify all the levels up to that value, as described here.
Well, you don't have to. The default value for GL_TEXTURE_MAX_LEVEL is 1000 and hence larger than any image pyramid you'll every reasonably use.
Still, I noticed I couldn't sample an immutable texture defined this way - until I called glGenerateMipmap or specified GL_TEXTURE_MAX_LEVEL to levels-1.
Yes, that's because image storage is independent of image sampling. The value of GL_TEXTURE_MAX_LEVEL is a parameter that affects image access at sampling time (you could set it into a Sampler Object as well) that's independent of the actual texture image storage. You can change the range of used image pyramid levels also after image specification, if you want to select only a subrange of images used during rendering, or only upload images into a subset of the allocated image pyramid.
EDIT reworded for clarification