CUDA Optimizations for the RTX A5000 [closed]

CUDA Optimizations for the RTX A5000 [closed] - c++

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 months ago.
Improve this question
I am a bit rusty on my CUDA skills. I am attempting to generate a 5120x5120 image that consists entirely of random noise generated via cuRand. I am using a single NVIDIA RTX A5000 who's compute compatibility is 8.6. My question is, what should be the ideal grid dimensions and threads per block to squeeze the highest amount of efficiency out of this noise generation. For context, here are some of the hardware specs of the A5000:
-64 SM's
-8192 CUDA cores
-128 cores per SM
-16 blocks per SM
-48 active warps per SM (1536 threads)

Both for general considerations of optimizing performance based on maximizing occupancy as well as considerations on CURAND generator initialization, you would want to choose a number of blocks and threads per block so that the product is equal to at most 64 SMs x 1536 threads per SM. Ideally you would target that number.
You would start by writing the kernel that takes generator state that has already been initialized, and runs a grid-stride loop to use random number generation to write image points.
If the occupancy analysis (probably based on registers per thread) suggests that maximum thread load per SM is possible (for that kernel that you just wrote), then you would size your grids that way.
If the occupancy analysis indicated that your kernel cannot have the full 1536 threads per SM, then you would reduce your grid launch (size) accordingly.
Since the cc8.6 SM has a maximum of 1536 threads, don't size your threadblocks at 1024 threads. Choose 512, or some other number like 256.
With CURAND, in my view, the best practice is to launch the generator state initialization kernel (separately, first) before your image update kernel. Don't try to do generator state initialization in the same kernel that is doing the image generation.
Once you figure out what the maximum occupancy is, then you will size your generator state array to match that, and you will launch a grid-stride kernel to fill that array as your CURAND init kernel.
Then launch your image creation kernel.
Here is a simple example:
$ cat t2115.cu
#include <curand_kernel.h>
#include <curand.h>
const int imageW = 5120;
const int imageH = 5120;
using mt = float;
__global__ void setup_kernel(curandState *state, size_t N, const unsigned long long seed = 1, const unsigned long long offset = 0){
for (size_t id = blockIdx.x * blockDim.x + threadIdx.x; id < N; id += gridDim.x*blockDim.x)
curand_init(seed, id, offset, state+id);
}
template <typename T>
__global__ void image_gen(T *img, curandState *state, const size_t img_size){
size_t id = blockIdx.x * blockDim.x + threadIdx.x;
curandState s = state[id];
for (;id < img_size; id += gridDim.x*blockDim.x)
img[id] = curand_uniform(&s);
}
int main(){
size_t is = ((size_t)imageW)*imageH;
int grid_dim = 64 * 1536;
int bs = 512;
int gs = grid_dim/bs;
mt *img;
curandState *s;
cudaMalloc(&s, sizeof(curandState)*grid_dim);
cudaMallocManaged(&img, sizeof(mt)*is);
setup_kernel<<<gs, bs>>>(s, grid_dim);
image_gen<<<gs, bs>>>(img, s, is);
cudaDeviceSynchronize();
}
$ nvcc -Xptxas=-v -arch=sm_86 -o t2115 t2115.cu
ptxas info : 218048 bytes gmem, 72 bytes cmem[3]
ptxas info : Compiling entry function '_Z9image_genIfEvPT_P17curandStateXORWOWm' for 'sm_86'
ptxas info : Function properties for _Z9image_genIfEvPT_P17curandStateXORWOWm
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 18 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWmyy' for 'sm_86'
ptxas info : Function properties for _Z12setup_kernelP17curandStateXORWOWmyy
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 384 bytes cmem[0]
$
We've compiled with the -Xptxas=-v switch which causes the compiler to indicate register utilization. We see that the image_gen function uses 18 registers per thread. This is well below the limit for what will allow full occupancy, so we should be fine with the indicated launch sizes and should expect full occupancy. The sm_86 SM has support for up to 65536 registers, so when considered across 1536 threads, this implies a limit of about 65536/1536 = 42 register per thread. If we had a number larger than that, we would reduce the grid_dim variable accordingly.
This can be largely "automated" with the occupancy API and launch bounds functionality, but these basic ideas should be understood first.

Related

How can I obtain consistently high throughput in this loop?

In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.
A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)
// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#define ARRAY_LEN 16
// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
uint64_t c0 = 0;
for (int i = 0; i < ARRAY_LEN; ++i) {
uint64_t t0;
__asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
c0 += t0;
}
return c0;
}
// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
clock_t t = clock();
for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
counts[i] = clz_array(arrays);
t = clock() - t;
// Convert clock time to milliseconds
return t * 1e3 / (double)CLOCKS_PER_SEC;
}
void print_stats(double t_ms, long n, double total_MiB) {
double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
}
int main(int argc, char *argv[]) {
long n = 1 << 20;
if (argc > 1)
n = atol(argv[1]);
long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
uint64_t *buf = malloc(total_bytes);
uint32_t *counts = malloc(sizeof(uint32_t) * n);
double t_ms, total_MiB = total_bytes / (double)(1 << 20);
printf("Total size: %.1f MiB\n", total_MiB);
// Warm up
t_ms = clz_arrays(counts, buf, n);
//print_stats(t_ms, n, total_MiB); // (1)
// Run it
t_ms = clz_arrays(counts, buf, n); // (2)
print_stats(t_ms, n, total_MiB);
// Write something into buf
for (long i = 0; i < n*ARRAY_LEN; ++i)
buf[i] = i;
// And again...
(void) clz_arrays(counts, buf, n); // (3)
t_ms = clz_arrays(counts, buf, n); // (4)
print_stats(t_ms, n, total_MiB);
free(counts);
free(buf);
return 0;
}
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory.
Here is the result of a typical run (compiler command is at the beginning of the source):
$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time: 47.78 ms, 209.30 x 1e6 clz/s, 25548.9 MiB/s
Time: 77.41 ms, 129.19 x 1e6 clz/s, 15769.7 MiB/s
The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.
The main question I have is,
Why is the bandwidth of call (4) always so far below the optimal value(s) obtained in call (2) and what can I do to guarantee optimal performance under a majority of circumstances?
Some points regarding what I've found so far:
According to extensive analysis with perf, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output of objdump -d clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably.
The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz").
Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
All of these results and observations can also be obtained with clz_array replaced by builtin_popcnt_unrolled_errata_manual from the Dan Luu post, mutatis mutandis.
Any help would be most appreciated!

The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory
Uninitialized memory that malloc gets from the kernel with mmap is all initially copy-on-write mapped to the same physical page of all zeros.
So you get TLB misses but not cache misses. If it used a 4k page, then you get L1D hits. If it used a 2M hugepage, then you only get L3 (LLC) hits, but that's still significantly better bandwidth than DRAM.
Single-core memory bandwidth is often limited by max_concurrency / latency, and often can't saturate DRAM bandwidth. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and the "latency-bound platforms" section of this answer for more about this in; it's much worse on many-core Xeon chips than on quad-core desktop/laptops.)
Your first warm-up run will suffer from page faults as well as TLB misses. Also, on a kernel with Meltdown mitigation enabled, any system call will flush the whole TLB. If you were adding extra print_stats to show the warm-up run performance, that would have made the run after slower.
You might want to loop multiple times over the same memory inside a timing run, so you don't need so many page-walks from touching so much virtual address space.
clock() is not a great way to measure performance. It records time in seconds, not CPU core clock cycles. If you run your benchmark long enough, you don't need really high precision, but you would need to control for CPU frequency to get accurate results. Calling clock() probably results in a system call, which (with Meltdown and Spectre mitigation enabled) flushes TLBs and branch-prediction. It may be slow enough for Skylake to clock back down from max turbo. You don't do any warm-up work after that, and of course you can't because anything after the first clock() is inside the timed interval.
Something based on wall-clock time which can use RDTSC as a timesource instead of switching to kernel mode (like gettimeofday()) would be lower overhead, although then you'd be measuring wall-clock time instead of CPU time. That's basically equivalent if the machine is otherwise idle so your process doesn't get descheduled.
For something that wasn't memory-bound, CPU performance counters to count core clock cycles can be very accurate, and without the inconvenience of having to control for CPU frequency. (Although these days you don't have to reboot to temporarily disable turbo and set the governor to performance.)
But with memory-bound stuff, changing core frequency changes the ratio of core to memory, making memory faster or slower relative to the CPU.

Why is std::fill(0) slower than std::fill(1)?

I have observed on a system that std::fill on a large std::vector<int> was significantly and consistently slower when setting a constant value 0 compared to a constant value 1 or a dynamic value:
5.8 GiB/s vs 7.5 GiB/s
However, the results are different for smaller data sizes, where fill(0) is faster:
With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):
This raises the secondary question, why the peak bandwidth of fill(1) is so much lower.
The test system for this was a dual socket Intel Xeon CPU E5-2680 v3 set at 2.5 GHz (via /sys/cpufreq) with 8x16 GiB DDR4-2133. I tested with GCC 6.1.0 (-O3) and Intel compiler 17.0.1 (-fast), both get identical results. GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 was set. Strem/add/24 threads gets 85 GiB/s on the system.
I was able to reproduce this effect on a different Haswell dual socket server system, but not any other architecture. For example on Sandy Bridge EP, memory performance is identical, while in cache fill(0) is much faster.
Here is the code to reproduce:
#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>
using value = int;
using vector = std::vector<value>;
constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;
void __attribute__((noinline)) fill0(vector& v) {
std::fill(v.begin(), v.end(), 0);
}
void __attribute__((noinline)) fill1(vector& v) {
std::fill(v.begin(), v.end(), 1);
}
void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
{
vector v(data_size / (sizeof(value) * nthreads));
auto repeat = write_size / data_size;
#pragma omp barrier
auto t0 = omp_get_wtime();
for (auto r = 0; r < repeat; r++)
fill0(v);
#pragma omp barrier
auto t1 = omp_get_wtime();
for (auto r = 0; r < repeat; r++)
fill1(v);
#pragma omp barrier
auto t2 = omp_get_wtime();
#pragma omp master
std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
<< write_size / (t2 - t1) << "\n";
}
}
int main(int argc, const char* argv[]) {
std::cout << "size,nthreads,fill0,fill1\n";
for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
bench(bytes, 1);
}
for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
bench(bytes, omp_get_max_threads());
}
for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
bench(max_data_size, nthreads);
}
}
Presented results compiled with g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp.

From your question + the compiler-generated asm from your answer:
fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. (Works best if the buffer is aligned, probably to at least 32B or maybe 64B).
fill(1) is a simple 128-bit movaps vector store loop. Only one store can execute per core clock cycle regardless of width, up to 256b AVX. So 128b stores can only fill half of Haswell's L1D cache write bandwidth. This is why fill(0) is about 2x as fast for buffers up to ~32kiB. Compile with -march=haswell or -march=native to fix that.
Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. (A comment on the question says that -march=native only helped fill(1) for L1.)
Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell.
Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd. There is some doubt about IvyBridge, where maybe only b is fast. See the #BeeOnRope's excellent ERMSB answer for updates on this.
gcc has some x86 tuning options for string ops (like -mstringop-strategy=alg and -mmemset-strategy=strategy), but IDK if any of them will get it to actually emit rep movsd for fill(1). Probably not, since I assume the code starts out as a loop, rather than a memset.
With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):
A normal movaps store to a cold cache line triggers a Read For Ownership (RFO). A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever).
#BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth.
movntps (_mm_stream_ps()) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps avoids RFOs, like rep stos does. (rep stos stores can reorder with each other, but not outside the boundaries of the instruction.)
Your movntps results in your updated answer are surprising.
For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB. So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. I don't currently have an explanation for that. (edits welcome with an explanation + good evidence).
As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (i.e. only when it saves the trouble of unpacking to 128b).
movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth).
#osgx posted some interesting links in comments:
Agner Fog's asm optimization guide, instruction tables, and microarch guide: http://agner.org/optimize/
Intel optimization guide: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.
NUMA snooping: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/
https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Cache Coherence Protocol and Memory
Performance of the Intel Haswell-EP Architecture
See also other stuff in the x86 tag wiki.

I'll share my preliminary findings, in the hope to encourage more detailed answers. I just felt this would be too much as part of the question itself.
The compiler optimizes fill(0) to a internal memset. It cannot do the same for fill(1), since memset only works on bytes.
Specifically, both glibcs __memset_avx2 and __intel_avx_rep_memset are implemented with a single hot instruction:
rep stos %al,%es:(%rdi)
Wheres the manual loop compiles down to an actual 128-bit instruction:
add $0x1,%rax
add $0x10,%rdx
movaps %xmm0,-0x10(%rdx)
cmp %rax,%r8
ja 400f41
Interestingly while there is a template/header optimization to implement std::fill via memset for byte types, but in this case it is a compiler optimization to transform the actual loop.
Strangely,for a std::vector<char>, gcc begins to optimize also fill(1). The Intel compiler does not, despite the memset template specification.
Since this happens only when the code is actually working in memory rather than cache, makes it appears the Haswell-EP architecture fails to efficiently consolidate the single byte writes.
I would appreciate any further insight into the issue and the related micro-architecture details. In particular it is unclear to me why this behaves so differently for four or more threads and why memset is so much faster in cache.
Update:
Here is a result in comparison with
fill(1) that uses -march=native (avx2 vmovdq %ymm0) - it works better in L1, but similar to the movaps %xmm0 version for other memory levels.
Variants of 32, 128 and 256 bit non-temporal stores. They perform consistently with the same performance regardless of the data size. All outperform the other variants in memory, especially for small numbers of threads. 128 bit and 256 bit perform exactly similar, for low numbers of threads 32 bit performs significantly worse.
For <= 6 thread, vmovnt has a 2x advantage over rep stos when operating in memory.
Single threaded bandwidth:
Aggregate bandwidth in memory:
Here is the code used for the additional tests with their respective hot-loops:
void __attribute__ ((noinline)) fill1(vector& v) {
std::fill(v.begin(), v.end(), 1);
}
┌─→add $0x1,%rax
│ vmovdq %ymm0,(%rdx)
│ add $0x20,%rdx
│ cmp %rdi,%rax
└──jb e0
void __attribute__ ((noinline)) fill1_nt_si32(vector& v) {
for (auto& elem : v) {
_mm_stream_si32(&elem, 1);
}
}
┌─→movnti %ecx,(%rax)
│ add $0x4,%rax
│ cmp %rdx,%rax
└──jne 18
void __attribute__ ((noinline)) fill1_nt_si128(vector& v) {
assert((long)v.data() % 32 == 0); // alignment
const __m128i buf = _mm_set1_epi32(1);
size_t i;
int* data;
int* end4 = &v[v.size() - (v.size() % 4)];
int* end = &v[v.size()];
for (data = v.data(); data < end4; data += 4) {
_mm_stream_si128((__m128i*)data, buf);
}
for (; data < end; data++) {
*data = 1;
}
}
┌─→vmovnt %xmm0,(%rdx)
│ add $0x10,%rdx
│ cmp %rcx,%rdx
└──jb 40
void __attribute__ ((noinline)) fill1_nt_si256(vector& v) {
assert((long)v.data() % 32 == 0); // alignment
const __m256i buf = _mm256_set1_epi32(1);
size_t i;
int* data;
int* end8 = &v[v.size() - (v.size() % 8)];
int* end = &v[v.size()];
for (data = v.data(); data < end8; data += 8) {
_mm256_stream_si256((__m256i*)data, buf);
}
for (; data < end; data++) {
*data = 1;
}
}
┌─→vmovnt %ymm0,(%rdx)
│ add $0x20,%rdx
│ cmp %rcx,%rdx
└──jb 40
Note: I had to do manual pointer calculation in order to get the loops so compact. Otherwise it would do vector indexing within the loop, probably due to the intrinsic confusing the optimizer.

Does `mem_fence` provide consistency between work-groups?

I am trying to implement the bounding-box calculation as described here. Long story short, I have a binary tree of bounding boxes. The leaf nodes are all filled in, and now it is time to calculate the internal nodes. In addition to the nodes (each defining the child/parent indices), there is a counter for each internal node.
Starting at each leaf node, the parent node is visited and its flag atomically incremented. If this is the first visit to the node, the thread exits (as only one child is guaranteed to have been initialized). If it is the second visit, then both children are initialized, its bounding box is calculated and we continue with that node's parents.
Is the mem_fence between reading the flag and reading the data of its children sufficient to guarantee the data in the children will be visible?
kernel void internalBounds(global struct Bound * const bounds,
global unsigned int * const flags,
const global struct Node * const nodes) {
const unsigned int n = get_global_size(0);
const size_t D = 3;
const size_t leaf_start = n - 1;
size_t node_idx = leaf_start + get_global_id(0);
do {
node_idx = nodes[node_idx].parent;
write_mem_fence(CLK_GLOBAL_MEM_FENCE);
// Mark node as visited, both children initialized on second visit
if (atomic_inc(&flags[node_idx]) < 1)
break;
read_mem_fence(CLK_GLOBAL_MEM_FENCE);
const global unsigned int * child_idxs = nodes[node_idx].internal.children;
for (size_t d = 0; d < D; d++) {
bounds[node_idx].min[d] = min(bounds[child_idxs[0]].min[d],
bounds[child_idxs[1]].min[d]);
bounds[node_idx].max[d] = max(bounds[child_idxs[0]].max[d],
bounds[child_idxs[1]].max[d]);
}
} while (node_idx != 0);
}
I am limited to OpenCL 1.2.

No it doesn't. CLK_GLOBAL_MEM_FENCE only provides consistency within the work group when accessing global memory. There is no inter-workgroup synchronization in OpenCL 1.x
Try to use a single, large workgroup and iterate over the data. And/or start with some small trees that will fit inside a single work group.

https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/mem_fence.html
mem_fence(...) syncs mem-accesses for only single work-item. Even if all work-items have this line, they may not hit(and continue) it at the same time.
barrier(...) does synchronize for all work items in a work group and have them wait for the slowest one(that isa accessing the specified memory given as parameter), but only connected to its own work groups work items.(such as only 64 or 256 for amd-intel and maybe 1024 for nvidia) because an opencl device driver implementation may be designed to finish all wavefronts before loading new shards of wavefronts because all global items would simply not fit inside chip memory(such as 64M work items each using 1kB local memory that need 64GB memory! --> even software emulation would need hundreds or thousands of passes and decrease performance to a level of single core cpu)
Global sync (where all work groups synchronized) is not possible.
Just in case work item work group and processing elements get mixed meanings,
OpenCL: Work items, Processing elements, NDRange
Atomic function you put there is already accesing global memory so adding group-scope synchronization shouldn't be important.
Also check machine codes if
bounds[child_idxs[0]].min[d]
is getting whole bounds[child_idxs[0]] struct into private memory before accessing to min[d]. If yes, you can separate min as an independent array access its items to have %100 more memory bandwidth for it.
Test on intel hd 400, more than 100000 threads
__kernel void fenceTest( __global float *c,
__global int *ctr)
{
int id=get_global_id(0);
if(id<128000)
for(int i=0;i<20000;i++)
{
c[id]+=ctr[0];
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
ctr[0]++;
}
2900ms (c array has garbage)
__kernel void fenceTest( __global float *c,
__global int *ctr)
{
int id=get_global_id(0);
if(id<128000)
for(int i=0;i<20000;i++)
{
c[id]+=ctr[0];
}
ctr[0]++;
}
500 ms(c array has garbage). 500ms is ~6x the performance of fence version(my laptop has single channel 4GB ram which is only 5-10 GB/s but its igpu local memory has nearly 38GB/s(64B per cycle and 600 MHz frequency)). Local fence version takes 700ms so the fenceless version doesn't even touching cache or local memory for some iterations as it seems.
Without loop, it takes 8-9 ms so it wasn't optimizing the loop in these kernels I suppose.
Edit:
int id=get_global_id(0);
if(id==0)
{
atom_inc(&ctr[0]);
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
c[id]+=ctr[0];
behaves exactly as
int id=get_global_id(0);
if(id==0)
{
ctr[0]++;
mem_fence(CLK_GLOBAL_MEM_FENCE);
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
c[id]+=ctr[0];
for this Intel igpu device(only by chance, but it proves changed memory is visible by "all" trailing threads, but doesn't prove it always happens(such as first compute unit hiccups and 2nd starts first for example) and it is not atomic for more than single threads accessing it).

Performance optimization with different blocks and threads in CUDA

I've written a program to compute a histogram, where each of the 256 values for a char byte is counted:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "..\..\common\book.h"
#include <stdio.h>
#include <cuda.h>
#include <conio.h>
#define SIZE (100*1024*1024)
__global__ void histo_kernel(unsigned char *buffer, long size, unsigned int *histo){
__shared__ unsigned int temp[256];
temp[threadIdx.x] = 0;
__syncthreads();
int i = threadIdx.x + blockIdx.x * blockDim.x;
int offset = blockDim.x * gridDim.x;
while (i < size) {
atomicAdd(&temp[buffer[i]], 1);
i += offset;}
__syncthreads();
atomicAdd(&(histo[threadIdx.x]), temp[threadIdx.x]);
}
int main()
{
unsigned char *buffer = (unsigned char*)big_random_block(SIZE);
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
unsigned char *dev_buffer;
unsigned int *dev_histo;
cudaMalloc((void**)&dev_buffer, SIZE);
cudaMemcpy(dev_buffer, buffer, SIZE, cudaMemcpyHostToDevice);
cudaMalloc((void**)&dev_histo, 256 * sizeof(long));
cudaMemset(dev_histo, 0, 256 * sizeof(int));
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int blocks = prop.multiProcessorCount;
histo_kernel << <blocks * 256 , 256>> >(dev_buffer, SIZE, dev_histo);
unsigned int histo[256];
cudaMemcpy(&histo, dev_histo, 256 * sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsed_time;
cudaEventElapsedTime(&elapsed_time, start, stop);
printf("Time to generate: %f ms\n", elapsed_time);
long sum = 0;
for (int i = 0; i < 256; i++)
sum += histo[i];
printf("The sum is %ld", sum);
cudaFree(dev_buffer);
cudaFree(dev_histo);
free(buffer);
getch();
return 0;
}
I'ves read in the book, CUDA by example, that launching the kernel with number of blocks twice the number of multiprocessors is empirically found to be the most optimal solution. Yet, when I launch it with 8 times the number of blocks, the running time is cut down.
I've run the kernel with: 1.Blocks same as the number of multiprocessors, 2.Blocks twice the number of multiprocessors, 3.Blocks 4 times, and so on.
With (1), I got the running time to be 112ms
With (2) I got the running time to be 73ms
With (3) I got the running time to be 52ms
Funnily, after the number of blocks being 8 times the number of multiprocessors, the running time did not vary by a significant amount. Like it was the same for block being 8 times and 256 times and 1024 times the number of multiprocessors.
How can this be explained?

This behavior is typical. The GPU is a latency-hiding machine. In order to hide latency, when it hits a stall, it needs additional new work available. You can maximize the amount of additional new work available by giving the GPU a large number of blocks and threads.
Once you have given it enough work to hide latency as best it can, giving it additional work does not help. The machine is saturated. However, having additional work available is generally/typically not much of a detriment either. There is little overhead associated with blocks and threads.
Whatever you read in CUDA by Example may have been true for a specific case, but it is certainly not generally true that the correct number of blocks to launch is equal to twice the number of multiprocessors. A better target (typically) would be 4-8 blocks per multiprocessor.
When it comes to blocks and threads, more is usually better, and it's rarely the case that having arbitrarily large numbers of blocks and threads will actually cause a significant degradation in performance. This is contrary to typical CPU thread programming, where having large numbers of OMP threads, for example, may lead to a significant reduction in performance, when you exceed the core count.
When you are tuning the code for the last 10% in performance, then you will see people limit the amount of blocks they launch, to some number that is typically 4-8 times the number of SMs, and construct their threadblocks to loop over the data set. But this normally only yields a few percent performance improvement, in most cases. As a reasonable CUDA programming starting point, aim for tens of thousands of threads, and hundreds of blocks, at least. A carefully tuned code may be able to saturate the machine with fewer blocks and threads, but it will become GPU-dependent at that point. And as I've stated already, there's rarely much of a performance detriment to having millions of threads and thousands of blocks.

CUDA: Hitting the maximum possible value of blockIdx.x for compute capability 3.0

First I should say I'm quite new to programming in C++ (let alone CUDA), though it is what I first learned with about 184 years ago. I'd say I'm a bit out of touch with memory allocation, and datatype sizes, though I'm learning. Anyway here goes:
I have a GPU with compute capability 3.0 (It's a Geforce 660 GTX w/ 2GB of DRAM).
Going by ./deviceQuery found in the CUDA samples (and by other charts I've found online), my maximum grid size is listed:
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
At 2,147,483,647 (2^31-1) that x dimension is huge and kind of nice… YET, when I run my code, pushing beyond 65535 in the x dimension, things get... weird.
I used an example from an Udacity course, and modified it to test the extremes. I've kept the kernel code fairly simple to prove the point:
__global__ void referr(long int *d_out, long int *d_in){
long int idx = blockIdx.x;
d_out[idx] = idx;
}
Please note below the ARRAY_SIZE being the size of the grid, but also being the size of the array of integers on which to do operations. I am leaving the size of the blocks at 1x1x1. JUST for the sake of understanding the limitations, I KNOW that having this many operations with blocks of only 1 thread makes no sense, but I want to understand what's going on with the grid size limitations.
int main(int argc, char ** argv) {
const long int ARRAY_SIZE = 522744;
const long int ARRAY_BYTES = ARRAY_SIZE * sizeof(long int);
// generate the input array on the host
long int h_in[ARRAY_SIZE];
for (long int i = 0; i < ARRAY_SIZE; i++) {
h_in[i] = i;
}
long int h_out[ARRAY_SIZE];
// declare GPU memory pointers
long int *d_in;
long int *d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);
// launch the kernel with ARRAY_SIZE blocks in the x dimension, with 1 thread each.
referr<<<ARRAY_SIZE, 1>>>(d_out, d_in);
// copy back the result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);
// print out the resulting array
for (long int i =0; i < ARRAY_SIZE; i++) {
printf("%li", h_out[i]);
printf(((i % 4) != 3) ? "\t" : "\n");
}
cudaFree(d_in);
cudaFree(d_out);
return 0;
}
This works as expected with an ARRAY_SIZE at MOST of 65535. The last few lines of the output below
65516 65517 65518 65519
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534
If I push the ARRAY_SIZE beyond this the output gets really unpredictable and eventually if the number gets too high I get a Segmentation fault (core dumped) message… whatever that even means. Ie. with an ARRAY_SIZE of 65536:
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534 131071
Why is it now stating that the blockIdx.x for this last one is 131071?? That is 65535+65535+1. Weird.
Even weirder, when I set the ARRAY_SIZE to 65537 (65535+2) I get some seriously strange results for the last lines of the output.
65520 65521 65522 65523
65524 65525 65526 65527
65528 65529 65530 65531
65532 65533 65534 131071
131072 131073 131074 131075
131076 131077 131078 131079
131080 131081 131082 131083
131084 131085 131086 131087
131088 131089 131090 131091
131092 131093 131094 131095
131096 131097 131098 131099
131100 131101 131102 131103
131104 131105 131106 131107
131108 131109 131110 131111
131112 131113 131114 131115
131116 131117 131118 131119
131120 131121 131122 131123
131124 131125 131126 131127
131128 131129 131130 131131
131132 131133 131134 131135
131136 131137 131138 131139
131140 131141 131142 131143
131144 131145 131146 131147
131148 131149 131150 131151
131152 131153 131154 131155
131156 131157 131158 131159
131160 131161 131162 131163
131164 131165 131166 131167
131168 131169 131170 131171
131172 131173 131174 131175
131176 131177 131178 131179
131180 131181 131182 131183
131184 131185 131186 131187
131188 131189 131190 131191
131192 131193 131194 131195
131196 131197 131198 131199
131200
Isn't 65535 the limit for older GPUs? Why is my GPU "messing up" when I push past the 65535 barrier for the x grid dimension? Or is this by design? What in the world is going on?
Wow, sorry for the long question.
Any help to understand this would be greatly appreciated! Thanks!

You should be using proper CUDA error checking . And you should be compiling for a compute 3.0 architecture by specifying -arch=sm_30 when you compile with nvcc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js