Is there an equivalent to gl_LocalInvocationIndex in a HLSL compute shader? - hlsl

Or do I need to calculate this myself? I can't find a reference for built in global variables in HLSL compute shaders.

This should be SV_GroupIndex, which, as described in msdn is :
The "flattened" index of a compute shader thread within a thread group, which turns the multi-dimensional SV_GroupThreadID into a 1D value. SV_GroupIndex varies from 0 to (numthreadsX * numthreadsY * numThreadsZ) – 1.
SV_GroupIndex = SV_GroupThreadID.z*dimx*dimy +
SV_GroupThreadID.y*dimx +
SV_GroupThreadID.x
MSDN Documentation Link

Related

Compute shader and workGroup

I want to understand how to work with compute shaders. I didn't find any details on the Internet. What is workingGroup?
layout (local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
what it meaning?
vkCmdDispatch(cmdBuffer, 1, 1, 1);
Should the values in the Shader and in the function be the same?
For understanding these basic concepts of compute shaders, material for OpenCL, OpenGL, Metal, D3D, and CUDA compute would also be relevant: they all use a similar hierarchical grid subdivision of work.
The hierarchy, from finest to coarsest, in Vulkan terms is: invocation (aka thread) > subgroup > local workgroup > global workgroup (aka dispatch). Subgroups are a more advanced topic; you can ignore them for now as they're mostly implicit. Just to be confusing, people often just say "workgroup" when they mean "local workgroup".
The layout(local_size) declaration in your shader defines the dimensions of a local workgroup in terms of individual invocations. The parameters to vkCmdDispatch give the dimensions of the global workgroup, in terms of local workgroups.
So if you call vkCmdDispatch(cmdbuf, M, N, P) and the compute shader in the current pipeline declared layout (local_size_x=X, local_size_y=Y, local_size_z=Z), then Vulkan will run MxNxP local workgroups, each of which consists of XxYxZ invocations of your shader.
Within each invocation you can find out where it is within the local and global grids with the GLSL built-in input variables gl_NumWorkGroups, gl_WorkGroupID, gl_LocalInvocationID, gl_GlobalInvocationID, and gl_LocalInvocationIndex.

How can I access the size of a Compute Shader's local work group from the CPU?

Given a compute shader where I have set the local size of each dimension to the values x, y and z, is there any way for me to access that information from the c++ code? ie,
//Pseudo Code c++
int size[3]
x = get local sizes from linked compute shader
print(x);
//GLSL Code
layout (local_size_x = a number, local_size_y = a number, local_size_z = a number) in;
Having run around looking, I found the following on Khronos.org, on its page concerning glGetProgramiv, found here:
https://www.khronos.org/registry/OpenGL-Refpages/es3/html/glGetProgramiv.xhtml
GL_COMPUTE_WORK_GROUP_SIZE
params returns an array of three integers containing the local work group size of the compute program as specified by its input layout qualifier(s). program must be the name of a program object that has been previously linked successfully and contains a binary for the compute shader stage.
This makes the line I needed
glGetProgramiv(ComputeShaderID, GL_COMPUTE_WORK_GROUP_SIZE, localWorkGroupSize);
where localWorkGroupSize is an array of 3 integers.

Manual depth rendering: Random results despite using atomic operations

i'm rendering single-pixel points into a uint32-texture with a compute shader. the texture is a 3d texture, x and y are viewport coordinates, z has depth information on coordinate 0 and additional attributes on 1. so two manually built rendertargets, if you will. code looks like this:
layout (r32ui, binding = 0) coherent volatile uniform uimage3D renderBuffer;
layout (rgba32f, binding = 1) restrict readonly uniform imageBuffer pointBuffer;
for(int j = 0; j < numPoints / gl_WorkGroupSize.x + 1; j++)
{
vec4 point = imageLoad(pointBuffer, ...)
// ... transform point ...
uint originalDepth = imageAtomicMin(renderBuffer, ivec3(imageCoords, 0), point.depth);
if (originalDepth >= point.depth)
{
// write happened, store the attributes
imageStore(renderBuffer, ivec3(imageCoords, 1), point.attributes);
}
}
while the depth values are correct, i have a few pixels where the attributes flicker between two values.
the order of points in the pointBuffer is random (but i've verified the set of all points is always the same), so my first thought was that two equal depth values might change the output, depending on which one comes first. so i made it that, if originalDepth == point.depth it uses imageAtomicMax to always have the same of the two alternative attributes written, but that changed nothing.
i scattered barrier() and memoryBarrier() all over the place, but that changed nothing. i also removed all diverging control flow for this, changed nothing.
reducing the local work size to 32 removes 90% of the flickering, but some still remains.
any ideas would be greatly appreciated.
edit: before you ask why i do this stuff manually instead of using normal rasterization and fragment shaders, the reason is performance. the rasterizer does not help since i'm rendering single-pixel-points, shared memory greatly speeded things up, and i render each point multiple times, which required me to use a geometry shader which was slow.
The problem is this: you have a race condition on writing to renderBuffer. If two different CS invocations map to the same pixel, and both of them decide to write the value, then there is a race on your imageStore call. One may overwrite the other, it may be a partial overwrite, or something else entirely. But in any case, it's not guaranteed to work.
This would be best solved by doing what rasterizers do: break the process down into two separate phases. The first phase does the ... transform point ... part, writing that data out to a buffer. The second phase then goes through the points and writes them to the final image.
In phase 2, each CS invocation performs all of the processing for a particular output pixel. That way, there are no race conditions. Of course, that requires that phase 1 produces data in a way that can be ordered per-pixel.
There are several ways to go about the latter. You could use a linked list, with a list per-pixel. Or your could use a list per-workgroup, where a workgroup represents some X/Y region of pixel space. In that case, you would use local shared memory as your local depth buffer, with all CS invocations reading from/writing to that region. After they all get done processing pixels, you write it out to real memory. Basically, you'd be implementing tile-based rendering manually.
Indeed, if you have a lot of these points, a tile-based solution would allow you to incorporate pipelining, so that you don't have to wait until all of phase 1 is done before starting on some of phase 2. You could break phase 1 down into chunks. You start a couple of phase 1 chunks, then a phase 2 chunk that reads from the first phase 1, then another phase 1, and so forth.
Vulkan with its event system, has better tools for building such an efficient dependency chain than OpenGL.

Explanation of dFdx

I am trying to understand the dFdx() and dFdy() functions in GLSL.
I understand the following:
The derivative is the rate of change
The partial derivative of a function with two parameters is when you differentiate the function while keeping one of the parameters constant.
dFdx() and dFdy() find the rate that a value changes between the current fragment and a neighboring fragment.
I don't understand what the rate of change is referring to. Is it the rate of change of fragment coordinates?
Could it possibly be that you can find the rate of change of an arbitrary variable between two invocations of the fragment shader? Are the shader invocations "reading" variables from neighboring invocations? For a (simplistic) example:
// invocation for fragment 1
float x = 1.0;
float d = dFdx(x);
// invocation for fragment next to fragment 1 along the x axis.
float x = 2.0;
float d = dFdx(x);
Would d be -1.0 and 1.0 respectively?
To understand how these instructions work, it helps to understand the basic execution architecture of GPUs and how fragment programs map to that architecture.
GPUs run a bunch of threads in 'lock-step' over the same program, which each thread having its own set of registers. So it fetches an instruction, then executes that instruction N times, once for each running thread. To deal with conditional branches and such, they also have an 'active mask' for the currently running group of threads. Threads that are not active in the mask don't actually run (so their registers don't change). Whenever there is a conditional branch or join (branch target) the thread mask is changed appropriately.
Now when a fragment program is run, the fragments to be run are arranged into "quads" -- 2x2 squares of 4 pixels that always run together in a thread group. Each thread in the group knows its own pixel coordinate, and can easily find the coordinate of the adjacent pixel in the quad by flipping the lowest bit of the x (or y) coord.
When the GPU executes a DDX or DDY instruction, what happens is that it peeks at the registers for the thread for the adjacent pixel and does a subtract with the value from the current pixel -- subtracting the value for the higher coordinate (lowest bit 1) from the lower (lowest bit 0).
This has implications if you use dFdx or dFdy in a conditional branch -- if one of the threads in a quad is active while the other is not, the GPU will still look at the register of the inactive thread, which might have any old value in it, so the result could be anything.

OpenGL Pixel Shader: how to generate random matrix of 0s and 1s (on each pixel)?

So what I need is simple: each time we perform our shader (meaning on each pixel) I need to calculate random matrix of 1s and 0s with resolution == originalImageResolution. How to do such thing?
As for now I have created one for shadertoy random matrix resolution is set to 15 by 15 here because gpu makes chrome fall often when I try stuff like 200 by 200 while really I need full image resolution size
#ifdef GL_ES
precision highp float;
#endif
uniform vec2 resolution;
uniform float time;
uniform sampler2D tex0;
float rand(vec2 co){
return fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * (43758.5453+ time));
}
vec3 getOne(){
vec2 p = gl_FragCoord.xy / resolution.xy;
vec3 one;
for(int i=0;i<15;i++){
for(int j=0;j<15;j++){
if(rand(p)<=0.5)
one = (one.xyz + texture2D(tex0,vec2(j,i)).xyz)/2.0;
}
}
return one;
}
void main(void)
{
gl_FragColor = vec4(getOne(),1.0);
}
And one for Adobe pixel bender:
<languageVersion: 1.0;>
kernel random
< namespace : "Random";
vendor : "Kabumbus";
version : 3;
description : "not as random as needed, not as fast as needed"; >
{
input image4 src;
output float4 outputColor;
float rand(float2 co, float2 co2){
return fract(sin(dot(co.xy ,float2(12.9898,78.233))) * (43758.5453 + (co2.x + co2.y )));
}
float4 getOne(){
float4 one;
float2 r = outCoord();
for(int i=0;i<200;i++){
for(int j=0;j<200;j++){
if(rand(r, float2(i,j))>=1.0)
one = (one + sampleLinear(src,float2(j,i)))/2.0;
}
}
return one;
}
void
evaluatePixel()
{
float4 oc = getOne();
outputColor = oc;
}
}
So my real problem is - my shaders make my GPU deiver fall. How to use GLSL for same purpose that I do now but with out failing and if possible faster?
Update:
What I want to create is called Single-Pixel Camera (google Compressive Imaging or Compressive Sensing), I want to create gpu based software implementation.
Idea is simple:
we have an image - NxM.
for each pixel in image we want GPU to performe the next operations:
to generate NxMmatrix of random values - 0s and 1s.
compute arithmetic mean of all pixels on original image whose coordinates correspond to coordinates of 1s in our random NxM matrix
output result of arithmetic mean as pixel color.
What I tried to implement in my shaders was simulate that wary process.
What is really stupid in trying to do this on gpu:
Compressive Sensing does not tall us to compute NxM matrix of such arithmetic mean values, it meeds just a peace of it (for example 1/3). So I put some pressure I do not need to on GPU. However testing on more data is not always a bad idea.
Thanks for adding more detail to clarify your question. My comments are getting too long so I'm going to an answer. Moving comments into here to keep them together:
Sorry to be slow, but I am trying to understand the problem and the goal. In your GLSL sample, I don't see a matrix being generated. I see a single vec3 being generated by summing a random selection (varying over time) of cells from a 15 x 15 texture (matrix). And that vec3 is recomputed for each pixel. Then the vec3 is used as the pixel color.
So I'm not clear whether you really want to create a matrix, or just want to compute a value for every pixel. The latter is in some sense a 'matrix', but computing a simple random value for 200 x 200 pixels would not strain your graphics driver. Also you said you wanted to use the matrix. So I don't think that's what you mean.
I'm trying to understand why you want a matrix - to preserve a consistent random basis for all the pixels? If so, you can either precompute a random texture, or use a consistent pseudorandom function like you have in rand() except not use time. You clearly know about that so I guess I still don't understand the goal. Why are you summing a random selection of cells from the texture, for each pixel?
I believe the reason your shader is crashing is that your main() function is exceeding its time limit - either for a single pixel, or for the whole set of pixels. Calling rand() 40,000 times per pixel (in a 200 * 200 nested loop) could certainly explain that!
If you had 200 x 200 pixels, and are calling sin() 40k times for each one, that's 160,000,000 calls per frame. Poor GPU!
I'm hopeful that if we understand the goal better, we'll be able to recommend a more efficient way to get the effect you want.
Update.
(Deleted this part, since it was mistaken. Even though many cells in the source matrix may each contribute less than a visually detectable amount of color to the result, the total of the many cells can contribute a visually detectable amount of color.)
New update based on updated question.
OK, (thinking "out loud" here so you can check whether I'm understanding correctly...) Since you need each of the random NxM values only once, there is no actual requirement to store them in a matrix; the values can simply be computed on demand and then thrown away. That's why your example code above does not actually generate a matrix.
This means we cannot get away from generating (NxM)^2 random values per frame, that is, NxM random values per pixel, and there are NxM pixels. So for N=M=200, that's 160 million random values per frame.
However, we can still optimize some things.
First, since your random values only need to be one bit each (you only need a boolean answer to decide whether to include each cell from the source texture into the mix), you can probably use a cheaper pseudo random number generator. The one you're using outputs much more random data per call than one bit. For example, you could call the same PRNG function as you're using now, but store the value and extract 32 random bits out of it. Or at least several, depending on how many are random enough. In addition, instead of using a sin() function, if you have extension GL_EXT_gpu_shader4 (for bitwise operators), you could use something like this:
.
int LFSR_Rand_Gen(in int n)
{
// <<, ^ and & require GL_EXT_gpu_shader4.
n = (n << 13) ^ n;
return (n * (n*n*15731+789221) + 1376312589) & 0x7fffffff;
}
Second, you are currently performing one divide operation per included cell (/2.0), which is probably relatively expensive, unless the compiler and GPU are able to optimize it into a bit shift (is that possible for floating point?). This also will not give the arithmetic mean of the input values, as discussed above... it will put much more weight on the later values and very little on the earlier ones. As a solution, keep a count of how many values are being included, and divide by that count once, after the loop is finished.
Whether these optimizations will be enough to enable your GPU driver to drive 200x200 * 200x200 pixels per frame, I don't know. They should definitely enable you to increase your resolution substantially.
Those are the ideas that occur to me off the top of my head. I am far from being a GPU expert though. It would be great if someone more qualified can chime in with suggestions.
P.S. In your comment, you jokingly (?) mentioned the option of precomputing N*M NxM random matrices. Maybe that's not a bad idea?? 40,000x40,000 is a big texture (40MB at least), but if you store 32 bits of random data per cell, that comes down to 1250 x 40,000 cells. Too bad vanilla GLSL doesn't help you with bitwise operators to extract the data, but even if you don't have the GL_EXT_gpu_shader4 extension you can still fake it. (Maybe you would also need a special extension then for non-square textures?)