How to provide computed number of workgroups to glDispatchComputeIndirect? - opengl

How is glDispatchComputeIndirect supposed to work if you want to calculate the number of threads of the second dispatch in the first one?
I have a compute shader that runs over a buffer, checks if the value of the element is valid, and then conditionally writes the index of the element into another buffer with the help of an atomic counter. How can I now dispatch a second compute shader with one thread for each written index most efficiently?
The probably slowest solution is to read back the value n of the atomic counter and glDispatchCompute(n / workgroupSize, 1, 1).
I thought about using glDispatchComputeIndirect and preparing the dispatch indirect buffer in the first compute shader. But the values in the dispatch indirect buffer are expected to be integer numbers of workgroups, not numbers of threads, so I cannot simply increment an atomic counter per written element. I could dispatch another compute shader with one thread that only divides the number of written elements by the workgroup size, but that's not a proper solution.
I could also still use the atomic "element counter" for counting of written elements, check the return value in each thread and increment another atomic "workgroup counter" whenever the return value of atomicAdd is divisible by the workgroup size. This saves me a return trip to the CPU and a third dispatch, but at the cost of another atomic counter. But I cannot think of any better solution right now.

You don't need the number of "threads". You need the number of workgroups. So calculate the thing you need to calculate.
The relationship between number of workgroups in the second dispatch call and the number of "threads" you compute is simple: (threadCount / threadPerGroup), where threadPerGroup is the number of invocations in the workgroup of the second compute shader.
Now you don't need to compute all of threadCount to compute this. All you really have to do is bump an atomic counter every time you increment threadCount past a multiple of threadPerGroup times. Which is easy enough, since atomicCounterIncrement returns the previous value of the atomic counter.
So your code would look like this:
if(<I should add a thread>)
{
uint oldThreadCount = atomicCounterIncrement(threadCount); //Returns old value
if(oldThreadCount % threadPerGroup == 0) //That means `threadCount` is now in the next group.
atomicCounterIncrement(groupCount);
}

Related

race condition using OpenMP atomic capture operation for 3D histogram of particles and making an index

I have a piece of code in my full code:
const unsigned int GL=8000000;
const int cuba=8;
const int cubn=cuba+cuba;
const int cub3=cubn*cubn*cubn;
int Length[cub3];
int Begin[cub3];
int Counter[cub3];
int MIndex[GL];
struct Particle{
int ix,jy,kz;
int ip;
};
Particle particles[GL];
int GetIndex(const Particle & p){return (p.ix+cuba+cubn*(p.jy+cuba+cubn*(p.kz+cuba)));}
...
#pragma omp parallel for
for(int i=0; i<cub3; ++i) Length[i]=Counter[i]=0;
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
int ic=GetIndex(particles[i]);
#pragma omp atomic update
Length[ic]++;
}
Begin[0]=0;
#pragma omp single
for(int i=1; i<cub3; ++i) Begin[i]=Begin[i-1]+Length[i-1];
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
if(particles[i].ip==3)
{
int ic=GetIndex(particles[i]);
if(ic>cub3 || ic<0) printf("ic=%d out of range!\n",ic);
int cnt=0;
#pragma omp atomic capture
cnt=Counter[ic]++;
MIndex[Begin[ic]+cnt]=i;
}
}
If to remove
#pragma omp parallel for
the code works properly and the output results are always the same.
But with this pragma there is some undefined behaviour/race condition in the code, because each time it gives different output results.
How to fix this issue?
Update: The task is the following. Have lots of particles with some random coordinates. Need to output to the array MIndex the indices in the array particles of the particles, which are in each cell (cartesian cube, for example, 1×1×1 cm) of the coordinate system. So, in the beginning of MIndex there should be the indices in the array particles of the particles in the 1st cell of the coordinate system, then - in the 2nd, then - in the 3rd and so on. The order of indices within given cell in the area MIndex is not important, may be arbitrary. If it is possible, need to make this in parallel, may be using atomic operations.
There is a straight way: to traverse across all the coordinate cells in parallel and in each cell check the coordinates of all the particles. But for large number of cells and particles this seems to be slow. Is there a faster approach? Is it possible to travel across the particles array only once in parallel and fill MIndex array using atomic operations, something like written in the code piece above?
You probably can't get a compiler to auto-parallelize scalar code for you if you want an algorithm that can work efficiently (without needing atomic RMWs on shared counters which would be a disaster, see below). But you might be able to use OpenMP as a way to start threads and get thread IDs.
Keep per-thread count arrays from the initial histogram, use in 2nd pass
(Update: this might not work: I didn't notice the if(particles[i].ip==3) in the source before. I was assuming that Count[ic] will go as high as Length[ic] in the serial version. If that's not the case, this strategy might leave gaps or something.
But as Laci points out, perhaps you want that check when calculating Length in the first place, then it would be fine.)
Manually multi-thread the first histogram (into Length[]), with each thread working on a known range of i values. Keep those per-thread lengths around, even as you sum across them and prefix-sum to build Begin[].
So Length[thread][ic] is the number of particles in that cube, out of the range of i values that this thread worked on. (And will loop over again in the 2nd loop: the key is that we divide the particles between threads the same way twice. Ideally with the same thread working on the same range, so things may still be hot in L1d cache.)
Pre-process that into a per-thread Begin[][] array, so each thread knows where in MIndex to put data from each bucket.
// pseudo-code, fairly close to actual C
for(ic < cub3) {
// perhaps do this "vertical" sum into a temporary array
// or prefix-sum within Length before combining across threads?
int pos = sum(Length[0..nthreads-1][ic-1]) + Begin[0][ic-1];
Begin[0][ic] = pos;
for (int t = 1 ; t<nthreads ; t++) {
pos += Length[t][ic]; // prefix-sum across threads for this cube bucket
Begin[t][ic] = pos;
}
}
This has a pretty terrible cache access pattern, especially with cuba=8 making Length[t][0] and Length[t+1][0] 4096 bytes apart from each other. (So 4k aliasing is a possible problem, as are cache conflict misses).
Perhaps each thread can prefix-sum its own slice of Length into that slice of Begin, 1. for cache access pattern (and locality since it just wrote those Lengths), and 2. to get some parallelism for that work.
Then in the final loop with MIndex, each thread can do int pos = --Length[t][ic] to derive a unique ID from the Length. (Like you were doing with Count[], but without introducing another per-thread array to zero.)
Each element of Length will return to zero, because the same thread is looking at the same points it just counted. With correctly-calculated Begin[t][ic] positions, MIndex[...] = i stores won't conflict. False sharing is still possible, but it's a large enough array that points will tend to be scattered around.
Don't overdo it with number of threads, especially if cuba is greater than 8. The amount of Length / Begin pre-processing work scales with number of threads, so it may be better to just leave some CPUs free for unrelated threads or tasks to get some throughput done. OTOH, with cuba=8 meaning each per-thread array is only 4096 bytes (too small to parallelize the zeroing of, BTW), it's really not that much.
(Previous answer before your edit made it clearer what was going on.)
Is this basically a histogram? If each thread has its own array of counts, you can sum them together at the end (you might need to do that manually, not have OpenMP do it for you). But it seems you also need this count to be unique within each voxel, to have MIndex updated properly? That might be a showstopper, like requiring adjusting every MIndex entry, if it's even possible.
After your update, you are doing a histogram into Length[], so that part can be sped up.
Atomic RMWs would be necessary for your code as-is, performance disaster
Atomic increments of shared counters would be slower, and on x86 might destroy the memory-level parallelism too badly. On x86, every atomic RMW is also a full memory barrier, draining the store buffer before it happens, and blocking later loads from starting until after it happens.
As opposed to a single thread which can have cache misses to multiple Counter, Begin and MIndex elements outstanding, using non-atomic accesses. (Thanks to out-of-order exec, the next iteration's load / inc / store for Counter[ic]++ can be doing the load while there are cache misses outstanding for Begin[ic] and/or for Mindex[] stores.)
ISAs that allow relaxed-atomic increment might be able to do this efficiently, like AArch64. (Again, OpenMP might not be able to do that for you.)
Even on x86, with enough (logical) cores, you might still get some speedup, especially if the Counter accesses are scattered enough they cores aren't constantly fighting over the same cache lines. You'd still get a lot of cache lines bouncing between cores, though, instead of staying hot in L1d or L2. (False sharing is a problem,
Perhaps software prefetch can help, like prefetchw (write-prefetching) the counter for 5 or 10 i iterations later.
It wouldn't be deterministic which point went in which order, even with memory_order_seq_cst increments, though. Whichever thread increments Counter[ic] first is the one that associates that cnt with that i.
Alternative access patterns
Perhaps have each thread scan all points, but only process a subset of them, with disjoint subsets. So the set of Counter[] elements that any given thread touches is only touched by that thread, so the increments can be non-atomic.
Filtering by p.kz ranges maybe makes the most sense since that's the largest multiplier in the indexing, so each thread "owns" a contiguous range of Counter[].
But if your points aren't uniformly distributed, you'd need to know how to break things up to approximately equally divide the work. And you can't just divide it more finely (like OMP schedule dynamic), since each thread is going to scan through all the points: that would multiply the amount of filtering work.
Maybe a couple fixed partitions would be a good tradeoff to gain some parallelism without introducing a lot of extra work.
Re: your edit
You already loop over the whole array of points doing Length[ic]++;? Seems redundant to do the same histogramming work again with Counter[ic]++;, but not obvious how to avoid it.
The count arrays are small, but if you don't need both when you're done, you could maybe just decrement Length to assign unique indices to each point in a voxel. At least the first histogram could benefit from parallelizing with different count arrays for each thread, and just vertically adding at the end. Should scale perfectly with threads since the count array is small enough for L1d cache.
BTW, for() Length[i]=Counter[i]=0; is too small to be worth parallelizing. For cuba=8, it's 8*8*16 * sizeof(int) = 4096 bytes, just one page, so it's just two small memsets.
(Of course if each thread has their own separate Length array, they each need to zero it). That's small enough to even consider unrolling with maybe 2 count arrays per thread to hide store/reload serial dependencies if a long sequence of points are all in the same bucket. Combining count arrays at the end is a job for #pragma omp simd or just normal auto-vectorization with gcc -O3 -march=native since it's integer work.
For the final loop, you could split your points array in half (assign half to each thread), and have one thread get unique IDs by counting down from --Length[i], and another counting up from 0 in Counter[i]++. With different threads looking at different points, this could give you a factor of 2 speedup. (Modulo contention for MIndex stores.)
To do more than just count up and down, you'd need info you don't have from just the overall Length array... but which you did have temporarily. See the section at the top
You are right to make the update Counter[ic]++ atomic, but there is an additional problem on the next line: MIndex[Begin[ic]+cnt]=i; Different iterations can write into the same location here, unless you have mathematical proof that this is never the case from the structure of MIndex. So you have to make that line atomic too. And then there is almost no parallel work left in your loop, so your speed up if probably going to be abysmal.
EDIT the second line however is not of the right form for an atomic operation, so you have to make it critical. Which is going to make performance even worse.
Also, #Laci is correct that since this is an overwrite statement, the order of parallel scheduling is going to influence the outcome. So either live with that fact, or accept that this can not be parallelized.

Which memory barriers are minimally needed for updating array elements with greater values?

What would be the minimally needed memory barriers in the following scenario?
Several threads update the elements of an array int a[n] in parallel.
All elements are initially set to zero.
Each thread computes a new value for each element; then,
it compares the computed new value to the existing value stored in the array,
and writes the new value only if it is greater than the stored value.
For example, if a thread computes for a[0] a new value 5, but
a[0] is already 10, then the thread should not update a[0].
But if the thread computes a new value 10, and a[0] is 5,
then the thread must update a[0].
The computation of the new values involves some shared read-only data;
it does not involve the array at all.
While the above-mentioned threads are running, no other thread accesses the array.
The array is consumed later, after all the threads are guaranteed to finish their updates.
The implementation uses a compare-and-swap loop, wrapping the elements
into atomic_ref (either from Boost or from C++20):
for (int k = 0; k != n; ++k) // For each element of the array
{
// Locally in this thread, compute the new value for a[k].
int new_value = ComputeTheNewValue(k);
// Establish atomic access to a[k].
atomic_ref<int> memory(a[k]);
// [Barrier_0]: Read the existing value.
int existing_value = memory.load(memory_order_relaxed);
while (true) // The compare-and-swap loop.
{
// Overwrite only with higher values.
if (new_value <= existing_value)
break;
// Need to update a[k] with the higher value "new_value", but
// only if a[k] still stores the "existing_value".
if (memory.compare_exchange_weak(existing_value, new_value,
/*Barrier_1*/ memory_order_relaxed,
/*Barrier_2*/ memory_order_relaxed))
{
// a[k] was still storing "existing_value", and it has been
// successfully updated with the higher "new_value".
// We're done, and we may exit the compare-and-swap loop.
break;
}
else
{
// We get here in two cases:
// 1. a[k] was found to store a value different from "existing_value", or
// 2. the compare-and-swap operation has failed spuriously.
// In the first case, the new value stored in a[k] has been loaded
// by compare_exchange_weak() function into the "existing_value" variable.
// Then, we need to compare the "new_value" produced by this thread
// with the newly loaded "existing_value". This is achieved by simply continuing the loop.
// The second case (the spurious failure) is also handled by continuing the loop,
// although in that case the "new_value <= existing_value" comparison is redundant.
continue;
}
}
}
This code involves three memory barriers:
Barrier_0 in memory.load().
Barrier_1, to use in read-modify-write when compare_exchange_weak() succeeds.
Barrier_2, to use in load operation when compare_exchange_weak() fails.
In this scenario, is the code guaranteed to update only with higher values
when all three bariers are set to relaxed?
If not, what minimal barriers are needed to guarantee the corrrect behavior?
Relaxed is fine, you don't need any ordering wrt. access to any other elements during the process of updating. And for accesses to the same location, ISO C++ guarantees that a "modification order" exists for each location separately, and that even relaxed operations will only see the same or later values in the modification order of the location between loaded or RMWed.
You're just building an atomic fetch_max primitive out of a CAS retry loop. Since the other writers are doing the same thing, the value of each location is monotonically increasing. So it's totally safe to bail out any time you see a value greater than the new_value.
For the main thread to collect the results at the end, you do need release/acquire synchronization like thread.join or some kind of flag. (e.g. maybe fetch_sub(1, release) of a counter of how many threads still have work left to do, or an array of done flags so you can just do a pure store.)
BTW, this seems likely to be slow, with lots of time spent waiting for cache lines to bounce between cores. (Lots of false-sharing.) Ideally you you can efficiently change this to have each thread work on different parts of the array (e.g. computing multiple candidates for the same index so it doesn't need any atomic stuff).
I cannot guarantee that the computed indices do not overlap. In practice, the overlapping is usually small, but it cannot be eliminated.
So apparently that's a no. And if the indices touched by different threads are in different cache lines (chunk of 16 int32_t) then there won't be too much false sharing. (Also, if computation is expensive so you aren't producing values very fast, that's good so atomic updates aren't what your code is spending most of its time on.)
But if there is significant contention and the array isn't huge, you could give each thread its own output array, and collect the results at the end. e.g. have one thread do a[i] = max(a[i], b[i], c[i], d[i]) for 4 to 8 arrays per loop. (Not too many read streams at once, and not a variable number of inputs because that probably couldn't compile efficiently). This should benefit from SIMD, e.g. SSE4.1 pmaxsd doing 4 parallel max operations, so this should be limited mostly by L3 cache bandwidth.
Or divide the max work between threads as a second parallel phase, with each thread doing the above over part of the output array. Or have the thread_id % 4 == 0 reduce results from itself and the next 3 threads, so you have a tree of reductions if you have a system with many threads.

OpenMp: Best way to create an array with size of the number of threads

specific i have to calculate pi parallel with OpenMp. I am just allowed using #omp parallel. So i wanted to create an array with the size of the number of processes and then calculating partially the sum parallel and then calculating the sums together. But it's unfortunately impossible to get the number of threads before the parallel version. So is the best way to create a very large array and initializing it with 0.0 and then calculating everything together or is there a better way? I would appreciate every answer. Thank you in advance!
Fortunately, it is not impossible to obtain the number of threads in advance. The OpenMP runtime does not simply launch a random number of threads without any control from both the programmer and the program user. On the contrary, it follows a well-defined mechanism to determine that number, which is described in detail in the OpenMP specification. Specifically, unless you've supplied a higher fixed number of threads with the num_threads, the number of threads OpenMP launches is limited by the value of the special internal control variable (ICV for short) called nthreads-var. The way to set this ICV is via the OMP_NUM_THREADS environment variable or via the omp_set_num_threads() call (the latter method overrides the former). The value of nthreads-var is accessible by calling omp_get_max_threads(). For other ICVs see the specification.
All you need to do is call omp_get_max_threads() and use the return value as the size of your array, for the number of threads will not exceed that value, given that you aren't calling omp_set_num_threads() with a larger value afterwards and aren't applying the num_threads clause to the parallel construct.

HLSL Get number of threadGroups and numthreads in code

my question concerns ComputeShader, HLSL code in particular. So, DeviceContext.Dispath(X, Y, Z) spawns X * Y * Z groups, each of which has x * y * z individual threads set in attribute [numthreads(x,y,z)]. The question is, how can I get total number of ThreadGroups dispatched and number of threads in a group? Let me explain why I want it - the amount of data I intend to process may vary significantly, so my methods should adapt to the size of input arrays. Of course I can send Dispath arguments in constant buffer to make it available from HLSL code, but what about number of threads in a group? I am looking for methods like GetThreadGroupNumber() and GetThreadNumberInGroup(). I appreciate any help.
The number of threads in a group is simply the product of the numthreads dimensions. For example, numthreads(32,8,4) will have 32*8*4 = 1024 threads per group. This can be determined statically at compile time.
The ID for a particular thread-group can be determined by adding a uint3 input argument with the SV_GroupId semantic.
The ID for a particular thread within a thread-group can be determined by adding a uint3 input argument with the SV_GroupThreadID semantic, or uint SV_GroupIndex if you prefer a flattened version.
As far as providing information to each thread on the total size of the dispatch, using a constant buffer is your best bet. This is analogous to the graphics pipeline, where the pixel shader doesn't naturally know the viewport dimensions.
It's also worth mentioning that if you do find yourself in a position where each thread needs to know the overall dispatch size, you should consider restructuring your algorithm. In general, it's better to dispatch a variable numbers of thread groups, each with a fixed amount of work, rather than dispatching a fixed number of threads with a variable amount of work. There are of course exceptions but this will tend provide better utilization of the hardware.

Filling counting 'buckets' in CUDA threads

In my program, I'm tracking a large number of particles through a voxel grid. The ratio of particles to voxels is arbitrary. At a certain point, I need to know which particles lie in which voxels, and how many do. Specifically, the voxels must know exactly which particles are contained within them. Since I can't use anything like std::vector in CUDA, I'm using the following algorithm (at the high level):
Allocate an array of ints the size of the number of voxels
Launch threads for the all the particles, determine the voxel each one lies in, and increase the appropriate counter in my 'bucket' array
Allocate an array of pointers the size of the number of particles
Calculate each voxel's offset into this new array (summing the number of particles in the voxels preceding it)
Place the particles in the array in an ordered fashion (I use this data to accelerate an operation later on. The speed increase is well worth the increased memory usage).
This breaks down on the second step though. I haven't been programming in CUDA for long, and just found out that simultaneous writes among threads to the same location in global memory produce undefined results. This is reflected in the fact that I mostly get 1's in buckets, with the occasional 2. Here's an sketch of the code I'm using for this step:
__global__ void GPU_AssignParticles(Particle* particles, Voxel* voxels, int* buckets) {
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if(tid < num_particles) { // <-- you can assume I actually passed this to the function :)
// Some math to determine the index of the voxel which this particle
// resides in.
buckets[index] += 1;
}
}
My question is, what's the proper way to generate these counts in CUDA?
Also, is there a way to store references to the particles within the voxels? The issue I see there is that the number of particles within a voxel constantly changes, so new arrays would have to be deallocated and reallocated almost every frame.
Although there may be more efficient solutions for calculating the bucket counts, a first working solution is to use your current approach, but using an atomic increment. This way only one thread at a time increments the bucket count atomically (synchronized over the whole grid):
if(tid < num_particles) {
// ...
atomicAdd(&buckets[index], 1);
}