Understanding the "sampler array index must be a literal expression" error in ComputeShaders - hlsl

Let's say I have a compute shader that retrieves data from a Texture2DArray using the Id of the group like this:
Texture2DArray<float4> gTextureArray[2];
[numthreads(32, 1, 1)]
void Kernel(uint3 GroupID : SV_GroupID, uint3 GroupThreadID : SV_GroupThreadID)
{
float3 tmp = gTextureArray[GroupID.x].Load(int4(GroupThreadID.x,GroupThreadID.x,0,0)).rgb;
....
}
And let's say I launch it like this deviceContext->Dispatch(2, 0, 0);
So, 2 groups, 32 threads each that read pixel values from a Texture2DArray. All the threads in GroupID.x = 0 will read values from gTextureArray[0] and all the threads in GroupID.y = 0 will read values from gTextureArray[1]. It turns out I can't compile that simple code, instead I get this compile error error X3512: sampler array index must be a literal expression
Now, I know I can do this instead:
Texture2DArray<float4> gTextureArray[2];
[numthreads(32, 1, 1)]
void Kernel(uint3 GroupID : SV_GroupID, uint3 GroupThreadID : SV_GroupThreadID)
{
float3 tmp = float3(0,0,0);
if(GroupID.x == 0)
tmp = gTextureArray[0].Load(int4(GroupThreadID.x,GroupThreadID.x,0,0)).rgb;
else if(GroupID.x == 1)
tmp = gTextureArray[1].Load(int4(GroupThreadID.x,GroupThreadID.x,0,0)).rgb;
....
}
Or use a switch in case I have lots of groups so it doesn't look that much awful (it still does)
Notice how there is no warp divergence since all threads in each group will go one branch or the other. My question is, am I missing something here? Why does HLSL not support that kind of indexing since I can not see any divergence or other problems, at least in this case?

I'm not sure if you bind your pipeline correctly, but let's evaluate both cases.
When you have:
Texture2DArray<float4> gTextureArray[2];
You technically bind 2 texture array (one in slot 0, one in slot 1), so runtime is not able to switch shader resource slot dynamically.
The line above is the same as doing:
Texture2DArray<float4> gTextureArray0;
Texture2DArray<float4> gTextureArray1;
You effectively bind 2 different resources in both cases, hence why you can't dynamically switch.
In the case where you have a Texture Array with 2 slices, this becomes possible, you need to change your code by:
Texture2DArray<float4> gTextureArray;
[numthreads(32, 1, 1)]
void Kernel(uint3 GroupID : SV_GroupID, uint3 GroupThreadID : SV_GroupThreadID)
{
float3 tmp = gTextureArray.Load(int4(GroupThreadID.x,GroupThreadID.x,GroupID.x,0)).rgb;
}
Z component is the slice index, so in that case that's perfectly possible.

Related

How to use collate_fn in LibTorch

I'm trying to implement an image based regression using a CNN in libtorch. The problem is, that my images got different sizes, which will cause an Exception batching the images.
First things first, I create my dataset:
auto set = MyDataSet(pathToData).map(torch::data::transforms::Stack<>());
Then I create the dataLoader:
auto dataLoader = torch::data::make_data_loader(
std::move(set),
torch::data::DataLoaderOptions().batch_size(batchSize).workers(numWorkersDataLoader)
);
The exception will be thrown batching data in the train loop:
for (torch::data::Example<> &batch: *dataLoader) {
processBatch(model, optimizer, counter, batch);
}
with a batch size greater than 1 (with a batch size of 1 everything works well because there isn't any stacking involved). For example I'll get the following error using a batch size of 2:
...
what(): stack expects each tensor to be equal size, but got [3, 1264, 532] at entry 0 and [3, 299, 294] at entry 1
I read that one could for example use collate_fn in order to implement some padding (for example here), I just do not get where to implement it. For example torch::data::DataLoaderOptions does not offer such a thing.
Does anyone know how to do this?
I've got a solution now. In summary, I'm split my CNN in Conv- and Denselayers and use the output of a torch::nn::AdaptiveMaxPool2d in the batch construction.
In order to do so, I have to modify my Dataset, Net and train/val/test-methods. In my Net I added two additional forward-functions. The first one passes data through all Conv-Layers and returns the output of an AdaptiveMaxPool2d-Layer. The second one passes the data through all Dense-Layers. In practice this looks like:
torch::Tensor forwardConLayer(torch::Tensor x) {
x = torch::relu(conv1(x));
x = torch::relu(conv2(x));
x = torch::relu(conv3(x));
x = torch::relu(ada1(x));
x = torch::flatten(x);
return x;
}
torch::Tensor forwardDenseLayer(torch::Tensor x) {
x = torch::relu(lin1(x));
x = lin2(x);
return x;
}
Then I override the get_batch method and use forwardConLayer to compute every batch entry. In order to train (correctly), I call zero_grad() before I construct a batch. All in all this looks like:
std::vector<ExampleType> get_batch(at::ArrayRef<size_t> indices) override {
// impl from bash.h
this->net.zero_grad();
std::vector<ExampleType> batch;
batch.reserve(indices.size());
for (const auto i : indices) {
ExampleType batchEntry = get(i);
auto batchEntryData = (batchEntry.data).unsqueeze(0);
auto newBatchEntryData = this->net.forwardConLayer(batchEntryData);
batchEntry.data = newBatchEntryData;
batch.push_back(batchEntry);
}
return batch;
}
Lastly I call forwardDenseLayer at all places where I normally would call forward, e.g.:
for (torch::data::Example<> &batch: *dataLoader) {
auto data = batch.data;
auto target = batch.target.squeeze();
auto output = model.forwardDenseLayer(data);
auto loss = torch::mse_loss(output, target);
LOG(INFO) << "Batch loss: " << loss.item<double>();
loss.backward();
optimizer.step();
}
Update
This solution seems to cause an error if the number of the dataloader's workers isn't 0. The error is:
terminate called after thro9wing an instance of 'std::runtime_error'
what(): one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [3, 12, 3, 3]] is at version 2; expected version 1 instead. ...
This error does make sense because the data is passing the CNN's head during the batching process. The solution to this "problem" is to set the number of workers to 0.

How update multiply descriptorSets at once

I have multiply render targets from my swapchain and create a descriptorSet per target. The idea ist to create one commandBuffer per renderTarget und bind the equivalent descriptor. Now I want to update the descriptorSets with one single call of updateDescriptorSets and used:
std::vector<vk::WriteDescriptorSet> writes;
for( UINT32 index = 0; index < descriptorSets.size(); ++index )
{
writes.push_back(
vk::WriteDescriptorSet( descriptorSets[index], 0, 0, 1,
vk::DescriptorType::eStorageImage,
&vk::DescriptorImageInfo( vk::Sampler(), imageViews[index], vk::ImageLayout::eGeneral ),
nullptr,
nullptr ) );
}
device.updateDescriptorSets( writes.size(), writes.data(), 0, nullptr );
With this approach only the last renderTarget in the queue presents the wanted result. The others produce just black screens.
But when i call updateDescriptorSets multiply times all works like expected:
std::vector<vk::WriteDescriptorSet> writes;
for( UINT32 index = 0; index < descriptorSets.size(); ++index )
{
writes.push_back(
vk::WriteDescriptorSet( descriptorSets[index], 0, 0, 1,
vk::DescriptorType::eStorageImage,
&vk::DescriptorImageInfo( vk::Sampler(), imageViews[index], vk::ImageLayout::eGeneral ),
nullptr,
nullptr ) );
device.updateDescriptorSets( writes.size(), writes.data(), 0, nullptr );
writes.clear();
}
I thought that i can update multiply descriptorSets at once. So it is not possible or what else could be my error.
PS: I use the c++ Header frome the vulkan SDK
The Bug is indeed in the application code.
The problem lies in the creation of the vk::DescriptorImageInfo. In both code examples the struct only exits in the scope of the for-loop but than only a pointer to the struct is copied in the vk::WriteDescriptorSet struct.
When the updateDescriptorSets in the first example processes the data in the structs, the relevant data is out of scope and the pointers are invalid. Coincidentally the application uses always the same memory space in every iteration of the loop and so the pointers points to the same invalid space where the data of the last loop iteration still exits. This is why the last render target shows the expected result.

How to create an array of cl::sycl::buffers?

I am using the Xilinx's triSYCL github implementation,https://github.com/triSYCL/triSYCL.
I am trying to create a design with 100 producer/consumer to read/write from 100 pipes.
What I am not sure of is, How to create an array of cl::sycl::buffer and initialize it using std::iota.
Here is my code:
constexpr size_t T=6;
constexpr size_t n_threads=100;
cl::sycl::buffer<float, n_threads> a { T };
for (int i=0; i<n_threads; i++)
{
auto ba = a[i].get_access<cl::sycl::access::mode::write>();
// Initialize buffer a with increasing integer numbers starting at 0
std::iota(ba.begin(), ba.end(), i*T);
}
And I am getting the following error:
error: no matching function for call to ‘cl::sycl::buffer<float, 2>::buffer(<brace-enclosed initializer list>)’
cl::sycl::buffer<float, n_threads> a { T };
I am new to C++ programming. So I am not able to figure out the exact way to do this.
There are 2 points I think cause the issue you are currently having:
The 2nd template argument in the buffer object definition should be the dimensionality of the buffer (count of dimensions, should be 1, 2 or 3), not the dimensions themselves.
The constructor for the buffer should contain either the actual dimensions of the buffer, or the data that you want the buffer to have and the dimensions. To pass the dimensions, you need to pass a cl::sycl::range object to the constructor
As I understand you are trying to initialize a buffer of dimensionality 1 and with dimensions { 100, 1, 1 }. To do this, the definition of a should change to:
cl::sycl::buffer < float, 1 > a(cl::sycl::range< 1 >(n_threads));
Also, as the dimensionality can be deduced from the range template parameter, thus you can achieve the same effect with:
cl::sycl::buffer< float > a (cl::sycl::range< 1 >(n_threads));
As for initializing the buffer with std::iota, you have 3 options:
Use an array to initialize the data with the iota usage and pass them to the sycl buffer (case A),
Use the accessor to write to the buffer directly for host - CPU only (case B), or
Use an accessor with a parallel_for for execution on either host or an OpenCL device (case C).
Accessors should not be used as iterators (with .begin(), .end())
Case A:
std::vector<float> data(n_threads); // or std::array<float, n_threads> data;
std::iota(data.begin(), data.end(), 0); // this will create the data { 0, 1, 2, 3, ... }
cl::sycl::buffer<float> a(data.data(), cl::sycl::range<1>(n_threads));
// The data in a are already initialized, you can create an accessor to use them directly
Case B:
cl::sycl::buffer<float> a(cl::sycl::range<1>(n_threads));
{
auto ba = a.get_access<cl::sycl::access::mode::write>();
for(size_t i=0; i< n_threads; i++) {
ba[i] = i;
}
}
Case C:
cl::sycl::buffer<float> a(cl::sycl::range<1>(n_threads));
cl::sycl::queue q{cl::sycl::default_selector()}; // create a command queue for host or device execution
q.Submit([&](cl::sycl::handler& cgh) {
auto ba = a.get_access<cl::sycl::access::mode::write>();
cgh.parallel_for<class kernel_name>([=](cl::sycl::id<1> i){
ba[i] = i.get(0);
});
});
q.wait_and_throw(); // wait until kernel execution completes
Also check chapter 4.8 of the SYCL 1.2.1 spec https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf as it has an example for iota
Disclaimer: triSYCL is a research project for now. Please use ComputeCpp for anything serious. :-)
If you really need arrays of buffer, I guess you can use something similar to Is there a way I can create an array of cl::sycl::pipe?
As a variant, you can use a std::vector<cl::sycl::buffer<float>> or std::array<cl::sycl::buffer<float>, n_threads> and initialize with a loop from a cl::sycl::buffer<float> { T }.

Setting pointer out of it's memory range

I'm writing some code to do bitmap blending and my function has a lot of options for it. I decided to use switch to handle those options, but then I needed to either put switch inside a loop (I read that it affects performance) or to assign loop for each switch case (makes code way too big). I decided to do this using third way (see below):
/* When I need to use static value */
BYTE *pointerToValue = (BYTE*)&blendData.primaryValue;
BYTE **pointerToReference = &pointerToValue;
*pointerToReference = *pointerToReference - 3;
/* When I need srcLine's 4th value (where srcLine is a pointer to BYTE array) */
BYTE **pointerToReference = &srcLine;
while (destY2 < destY1) {
destLine = destPixelArray + (destBytesPerLine * destY2++) + (destX1 * destInc);
srcLine = srcPixelArray + (srcBytesPerLine * srcY2++) + (srcX1 * srcInc);
for (LONG x = destX1; x < destX2; x++, destLine += destInc, srcLine += srcInc) {
BYTE neededValue = *(*pointerToReference + 3); //not yet implemented
destLine[0] = srcLine[0];
destLine[1] = srcLine[1];
destLine[2] = srcLine[2];
if (diffInc == BOTH_ARE_32_BIT)
destLine[3] = srcLine[3];
}
}
Sometimes I might need to use srcLine[3] or blendData.primaryValue. srcLine[3] can be accessed easily with *(*pointerToReference + 3), however to access blendData.primaryValue I need to reduce pointer by 3 in order to keep the same expression (*(*pointerToReference + 3)).
So here are my questions:
Is it safe to set pointer out of its memory range if later it is
going to brought back?
I'm 100% sure that it won't be used when it's out of range, but can
I be sure that it won't cause any kind of access violation?
Maybe there is some kind of similar alternative to use one variable
to capture a value of srcLine[3] or blendData.primaryValue
without if(), like it's done in my code sample?
Because of #2, no usage, the answer to #1 is yes, it is perfectly safe. Because of #1, then, there is no need for #3. :-)
An access violation could only happen if the pointer were actually used.

Grouping 2D array of instances

I have a working solution for this, though I am convinced there must be a better implementation. In a nut shell the problem is this:
I am working on a connect>=3, bejewelled style puzzle game. When the state of the 'board' changes I group all the pieces such that if they are 'connected' horizontally or vertically they share an ID. This is how I do it currently:
[pseudo]
for all( array object* )
{
if !in_a_group() check_neighbours( this )
}
void check_neighbours( ID )
{
if ( check_is_same_type( left ) )
{
if !in_a_group() this.ID = ID ; check_neighbours( ID )
else if( in_a_group ) change_IDs(this.ID, ID )
}
same for right ...
up ...
down ...
}
That is a really dirty pseudo version of what I do.
I recursively call check_neighbours, passing the ID of the first branch piece forward (I use this pointer as an ID rather than generating one).
If I find a piece with a different ID that is connected I overwrite all pieces with that ID with new ID ( I have an ASSERT here cos it shouldn't actually happen. It hasn't so far in lots of testing)
I don't check_neighbours at the original branch unless the piece has no ID.
This works just fine, though my pseudo is probably missing some small logic.
My problem is that it has the potential to use many branches (which may be a problem on the hardware I am working on). I have worked on the problem so long now that I can't see another solution. Ever get the feeling you are missing something incredibly obvious?
Also I am new to stackoverflow, reasonably new to programming and any advice on etiquette etc is appreciated.
How would you suggest avoiding recursion?
As I understand it, your algorithm is basically a "flood fill" with a small twist.
Anyway, to avoid recursion, you need to allocate array to store coordinates of unprocessed items and use queue or fifo. Because you know dimensions of grid (and since it is bejeweled-style(?) game, you should be able to preallocate it pretty much anywhere.
pseudocode for any flood-fill-type recursive algorithm.
struct Coord{
int x, y;
}
typedef std::queue<Coord> CoordQueue;
bool validCoord(Coord coord){
return (coord.x >= 0) && (coord.y >= 0)
&& (coord.x < boardSizeX) && (coord.y < boardSizeY);
}
bool mustProcessCoord(Coord coord);
void processAll(){
CoordQueue coordQueue;
Coord startPoint = Coord(0, 0);
coordQueue.pushBack(startPoint);
while (!coordQueue.empty()){
const Coord &curCoord = coordQueue.front();
//do something with current coordinates.
processCoord(curCoord);
const int numOffsets = 4;
const int offsets[numOffsets][2] = {{-1, 0}, {0, -1}, {1, 0}, {0, 1}};
for (int offsetIndex = 0; offsetIndex < numOffsets; offsetIndex++){
Coord neighborCoord = Coord(curCoord.x + offsets[offsetIndex][0],
curCoord.y + offsets[offsetIndex][1]);
if (!isValidCoord(neighborCoord) || !mustProcessCoord(neighborCoord))
continue;
coordQueue.push_back(neighborCoord);
}
coordQueue.pop_front();
}
}
See ? no recursion, just a loop. Pretty much any recursive function can be unwrapped into something like that.
If your underlying platform is restrictive and you have no std::queue, use array (ring buffer implemented as array can act like fifo queue) instead. Because you know size of the board, you can precalculate size of array. The rest should be easy.