gl_SubgroupInvocationID numeration in multi-dimensional work-groups - opengl

Regarding GL_KHR_shader_subgroup extension and compute shader... gl_SubgroupInvocationID is 1-dimensional (single uint) while work-groups are can be up to 3-dimensional. So I'm wondering which 3-dimensional numbers of gl_LocalInvocationID (threads within work-group) get to be in each subgroup.
For example: let's say I have work-group 16x16x1 and gl_SubgroupSize is 32. How can I tell which threads of current work-group are in subgroup 0 (gl_SubgroupInvocationID == 0)?

The partitioning of a workgroup into subgroups is implementation defined. Furthermore, you're not supposed to care about it.
The purpose of the subgroup functions is to facilitate intercommunication between subgroups. To ask about values computed in other subgroups relative to your invocation. The only functions that care about a specific "physical" relationship between subgroups are the Quad functions, which deal with operations on 2x2 blocks within fragment shaders.

Related

Is there a way to calculate dFdx(dFdx()) of something within a fragment shader?

So I already know that the documentation for dFdx, dFdy, and fwidth states that "expressions that imply higher-order derivatives such as dFdx(dFdx(n)) have undefined results, as do mixed-order derivatives such as dFdx(dFdy(n))." If such expressions are undefined, is it possible to get higher-order derivatives of some expression within a fragment shader?
I hear that dFdx gets information from neighboring fragments and finds the difference between the neighbor's values and this fragment's values. Perhaps there is a way to manually take information from neighboring fragments?
I think there is a formula that can be used to find the second-order derivative:
(f(x+h,y+h) - f(x+h,y) - f(x,y+h) + f(x,y))/h^2
But my question is, how do we get terms f(x+h,y+h), f(x+h,y), f(x,y+h)? How do we also get h, which is the distance between fragments?
Perhaps there is a way to manually take information from neighboring fragments?
Even if you could (and with some subgroup extensions, you can), it wouldn't help.
Fragment shaders execute invocations in 2x2 quads, with groups of 4 invocations that are directly adjacent to each other. The derivative functions merely take the difference between data in the horizontal/vertical fragments in the quad. If one or more of the fragments in a quad happens to be outside of the area of the primitive being rasterized, it still gets executed (in order to compute derivatives), but it has no visible effects. These are called "helper invocations.
Regardless, invocations in a quad can only talk to other invocations in the same quad. And if you wanted to get higher order derivatives, you would need to sample from more than just a single adjacent fragment.

about organizing threads in cuda

general question: the number of threads must be equal to the size of the elements i want to deal with? exmaple: if i have matrix M[a][b]. i must allocate (aXb) threads or i can allocate more threads than i need(more than ab)? because the thread that will focus on element aXb+1 will throw us out, doesnt he? or the solution is to put a condition(only if in range(ab))?
specific question: let be M[x][y] matrix with x rows and y columns. consider that 1000 <= x <= 300000 and y <= 100. how can i organize the threads in that way that it will be general for each input for x and y. i want that each thread will focus on one element in the matrix. CC = 2.1 thanks!
General answer: It depends on a problem.
In most cases natural one-to-one mapping of the problem to the grid of threads is fine to start with, but what you want to keep in mind is:
Achieving high occupancy.
Maximizing GPU resources usage and memory throughput.
Working with valid data.
Sometimes it may require using single thread to process many elements or many threads to process single element.
For instance, you can imagine an series of independent operations A,B and C that need to be applied on array of elements. You could run three different kernels, but it might be better choice to allocate the grid to contain three times more threads than there is elements and distinguish operations by one of the dimensions of the grid (or anything else). On the other side - you might have a problem that could use maximizing the usage of shared memory (e.g transforming the image) - you could use block of 16 threads to process 5x5 image window where each thread would calculate some statistics of each 2x2 slice.
The choice is yours - the best advice is not always go with the obvious. Try different approaches and choose what works best.

HLSL Get number of threadGroups and numthreads in code

my question concerns ComputeShader, HLSL code in particular. So, DeviceContext.Dispath(X, Y, Z) spawns X * Y * Z groups, each of which has x * y * z individual threads set in attribute [numthreads(x,y,z)]. The question is, how can I get total number of ThreadGroups dispatched and number of threads in a group? Let me explain why I want it - the amount of data I intend to process may vary significantly, so my methods should adapt to the size of input arrays. Of course I can send Dispath arguments in constant buffer to make it available from HLSL code, but what about number of threads in a group? I am looking for methods like GetThreadGroupNumber() and GetThreadNumberInGroup(). I appreciate any help.
The number of threads in a group is simply the product of the numthreads dimensions. For example, numthreads(32,8,4) will have 32*8*4 = 1024 threads per group. This can be determined statically at compile time.
The ID for a particular thread-group can be determined by adding a uint3 input argument with the SV_GroupId semantic.
The ID for a particular thread within a thread-group can be determined by adding a uint3 input argument with the SV_GroupThreadID semantic, or uint SV_GroupIndex if you prefer a flattened version.
As far as providing information to each thread on the total size of the dispatch, using a constant buffer is your best bet. This is analogous to the graphics pipeline, where the pixel shader doesn't naturally know the viewport dimensions.
It's also worth mentioning that if you do find yourself in a position where each thread needs to know the overall dispatch size, you should consider restructuring your algorithm. In general, it's better to dispatch a variable numbers of thread groups, each with a fixed amount of work, rather than dispatching a fixed number of threads with a variable amount of work. There are of course exceptions but this will tend provide better utilization of the hardware.

Usage of glBindBufferRange with transform feedback

I have a buffer that I would like to fill over successive transform feedbacks, and I am wondering how exactly to do this.
glBindBufferRange has five arguments, I understand that the first three are equivalent to the arguments of glBindBufferBase, but I have a few questions about the offset and size arguments.
If my first transform feedback produced n primitives, as retrieved from GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN, my primitives are points, and I want to continue from that position in the buffer, should the offset of glBindBufferRange be set to n*4*sizeof(GLfloat)? (assuming I am retrieving a vec4 geometry shader output)
The docs just say that offset and size should be in basic machine units (although they have two different types, GLintptr and GLsizeiptr), but I'm not exactly sure what that means, so I assumed bytes, is this correct?
Yes, the amount of data written to a buffer during transform feedback is the number of primitives written * the number of components of those primitives * the size of a primitive. And yes, "basic machine units" is standardese for "byte".

What is the most efficient (yet sufficiently flexible) way to store multi-dimensional variable-length data?

I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an individual data set during runtime without too much overhead.
Note: I know this is a somewhat lengthy question, but I have looked around quite a lot and could not find a solution or example which describes the problem at hand with sufficient accuracy.
Background
The context is a computational fluid dynamics (CFD) code that is based on the discontinuous Galerkin spectral element method (DGSEM) (cf. Kopriva (2009), Implementing Spectral Methods for Partial Differential Equations). For the sake of simplicity, let us assume a 2D data layout (it is in fact in three dimensions, but the extension from 2D to 3D should be straightforward).
I have a grid that consists of K square elements k (k = 0,...,K-1) that can be of different (physical) sizes. Within each grid element (or "cell") k, I have N_k^2 data points. N_k is the number of data points in each dimension, and can vary between different grid cells.
At each data point n_k,i (where i = 0,...,N_k^2-1) I have to store an array of solution values, which has the same length nVars in the whole domain (i.e. everywhere), and which does not change during runtime.
Dimensions and changes
The number of grid cells K is of O(10^5) to O(10^6) and can change during runtime.
The number of data points N_k in each grid cell is between 2 and 8 and can change during runtime (and may be different for different cells).
The number of variables nVars stored at each grid point is around 5 to 10 and cannot change during runtime (it is also the same for every grid cell).
Requirements
Performance is the key issue here. I need to be able to regularly iterate in an ordered fashion over all grid points of all cells in an efficient manner (i.e. without too many cache misses). Generally, K and N_k do not change very often during the simulation, so for example a large contiguous block of memory for all cells and data points could be an option.
However, I do need to be able to refine or coarsen the grid (i.e. delete cells and create new ones, the new ones may be appended to the end) during runtime. I also need to be able to change the approximation order N_k, so the number of data points I store for each cell can change during runtime as well.
Conclusion
Any input is appreciated. If you have experience yourself, or just know a few good resources that I could look at, please let me know. However, while the solution will be crucial to the performance of the final program, it is just one of many problems, so the solution needs to be of an applied nature and not purely academic.
Should this be the wrong venue to ask this question, please let me know what a more suitable place would be.
Often, these sorts of dynamic mesh structures can be very tricky to deal with efficiently, but in block-structured adaptive mesh refinement codes (common in astrophysics, where complex geometries aren't important) or your spectral element code where you have large block sizes, it is often much less of an issue. You have so much work to do per block/element (with at least 10^5 cells x 2 points/cell in your case) that the cost of switching between blocks is comparitively minor.
Keep in mind, too, that you can't generally do too much of the hard work on each element or block until a substantial amount of that block's data is already in cache. You're already going to have to had flushed most of block N's data out of cache before getting much work done on block N+1's anyway. (There might be some operations in your code which are exceptions to this, but those are probably not the ones where you're spending much time anyway, cache or no cache, because there's not a lot of data reuse - eg, elementwise operations on cell values). So keeping each the blocks/elements beside each other isn't necessarily a huge deal; on the other hand, you definitely want the blocks/elements to be themselves contiguous.
Also notice that you can move blocks around to keep them contiguous as things get resized, but not only are all those memory copies also going to wipe your cache, but the memory copies themselves get very expensive. If your problem is filling a significant fraction of memory (and aren't we always?), say 1GB, and you have to move 20% of that around after a refinement to make things contiguous again, that's .2 GB (read + write) / ~20 GB/s ~ 20 ms compared to reloading (say) 16k cache lines at ~100ns each ~ 1.5 ms. And your cache is trashed after the shuffle anyway. This might still be worth doing if you knew that you were going to do the refinement/derefinement very seldom.
But as a practical matter, most adaptive mesh codes in astrophysical fluid dynamics (where I know the codes well enough to say) simply maintain a list of blocks and their metadata and don't worry about their contiguity. YMMV of course. My suggestion would be - before spending too much time crafting the perfect data structure - to first just test the operation on two elements, twice; the first, with the elements in order and computing on them 1-2, and the second, doing the operation in the "wrong" order, 2-1, and timing the two computations several times.
For each cell, store the offset in which to find the cell data in a contiguous array. This offset mapping is very efficient and widely used. You can reorder the cells for cache reuse in traversals. When the order or number of cells changes, create a new array and interpolate, then throw away the old arrays. This storage is much better for external analysis because operations like inner products in Krylov methods and stages in Runge-Kutta methods can be managed without reference to the mesh. It also requires minimal memory per vector (e.g. in Krylov bases and with time integration).