CUDA concurrent kernel launch not working - c++

I'm writing a CUDA program for image processing. Same kernel "processOneChannel" will be launched for RGB channels.
Below I try to specify streams for the three kernel launches so they can be processed concurrently. But nvprof says they are still launched one after another...
There are two other kernels before and after these three, and I don't want them to run concurrently.
Basically I want the following:
seperateChannels --> processOneChannel(x3) --> recombineChannels
Please advice what I did wrong..
void kernelLauncher(const ushort4 * const h_inputImageRGBA, ushort4 * const d_inputImageRGBA,
ushort4* const d_outputImageRGBA, const size_t numRows, const size_t numCols,
unsigned short *d_redProcessed,
unsigned short *d_greenProcessed,
unsigned short *d_blueProcessed,
unsigned short *d_prand)
{
int MAXTHREADSx = 512;
int MAXTHREADSy = 1;
int nBlockX = numCols / MAXTHREADSx + 1;
int nBlockY = numRows / MAXTHREADSy + 1;
const dim3 blockSize(MAXTHREADSx,MAXTHREADSy,1);
const dim3 gridSize(nBlockX,nBlockY,1);
// cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
int nstreams = 5;
cudaStream_t *streams = (cudaStream_t *) malloc(nstreams * sizeof(cudaStream_t));
for (int i = 0; i < nstreams; i++)
{
checkCudaErrors(cudaStreamCreateWithFlags(&(streams[i]),cudaStreamNonBlocking));
}
separateChannels<<<gridSize,blockSize>>>(d_inputImageRGBA,
(int)numRows,
(int)numCols,
d_red,
d_green,
d_blue);
cudaDeviceSynchronize();
checkCudaErrors(cudaGetLastError());
processOneChannel<<<gridSize,blockSize,0,streams[0]>>>(d_red,
d_redProcessed,
(int)numRows,(int)numCols,
d_filter,d_prand);
processOneChannel<<<gridSize,blockSize,0,streams[1]>>>(d_green,
d_greenProcessed,
(int)numRows,(int)numCols,
d_filter,d_prand);
processOneChannel<<<gridSize,blockSize,0,streams[2]>>>(d_blue,
d_blueProcessed,
(int)numRows,(int)numCols,
d_filter,d_prand);
cudaDeviceSynchronize();
checkCudaErrors(cudaGetLastError());
recombineChannels<<<gridSize, blockSize>>>(d_redProcessed,
d_greenProcessed,
d_blueProcessed,
d_outputImageRGBA,
numRows,
numCols);
for (int i = 0; i < nstreams; i++)
{
cudaStreamDestroy(streams[i]);
}
free(streams);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}
Here's nvprof gpu trace output. Note the memcpy before the kernel launches are to pass filter data for the processing, so they cannot run in concurrency with kernel launches.
==10001== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
1.02428s 2.2400us - - - - - 28.125MB 1e+04GB/s GeForce GT 750M 1 13 [CUDA memset]
1.02855s 18.501ms - - - - - 28.125MB 1.4846GB/s GeForce GT 750M 1 13 [CUDA memcpy HtoD]
1.21959s 1.1371ms - - - - - 1.7580MB 1.5098GB/s GeForce GT 750M 1 13 [CUDA memcpy HtoD]
1.22083s 1.3440us - - - - - 7.0313MB 5e+03GB/s GeForce GT 750M 1 13 [CUDA memset]
1.22164s 1.3440us - - - - - 7.0313MB 5e+03GB/s GeForce GT 750M 1 13 [CUDA memset]
1.22243s 3.6480us - - - - - 7.0313MB 2e+03GB/s GeForce GT 750M 1 13 [CUDA memset]
1.22349s 10.240us - - - - - 8.0000KB 762.94MB/s GeForce GT 750M 1 13 [CUDA memcpy HtoD]
1.22351s 6.6021ms (6 1441 1) (512 1 1) 12 0B 0B - - GeForce GT 750M 1 13 separateChannels(...) [123]
1.23019s 10.661ms (6 1441 1) (512 1 1) 36 192B 0B - - GeForce GT 750M 1 14 processOneChannel(...) [133]
1.24085s 10.518ms (6 1441 1) (512 1 1) 36 192B 0B - - GeForce GT 750M 1 15 processOneChannel(...) [141]
1.25137s 10.779ms (6 1441 1) (512 1 1) 36 192B 0B - - GeForce GT 750M 1 16 processOneChannel(...) [149]
1.26372s 5.7810ms (6 1441 1) (512 1 1) 15 0B 0B - - GeForce GT 750M 1 13 recombineChannels(...) [159]
1.26970s 19.859ms - - - - - 28.125MB 1.3831GB/s GeForce GT 750M 1 13 [CUDA memcpy DtoH]
Here's CMakeList.txt where I passed -default-stream per-thread to nvcc
cmake_minimum_required(VERSION 2.6 FATAL_ERROR)
find_package(OpenCV REQUIRED)
find_package(CUDA REQUIRED)
set(
CUDA_NVCC_FLAGS
${CUDA_NVCC_FLAGS};
-default-stream per-thread
)
file( GLOB hdr *.hpp *.h )
file( GLOB cu *.cu)
SET (My_files main.cpp)
# Project Executable
CUDA_ADD_EXECUTABLE(My ${My_files} ${hdr} ${cu})
target_link_libraries(My ${OpenCV_LIBS})

Each kernel is launching 6*1441 which is over 8000 blocks, of 512 threads each. That is filling the machine, preventing blocks from subsequent kernel launches from executing.
The machine has a capacity. The maximum instantaneous capacity in blocks is equal to the number of SMs in your GPU multiplied by the maximum number of blocks per SM, both of which are specifications that you can retrieve with the deviceQuery app. When you fill it up, it cannot process more blocks until some of the already running blocks have retired. This process will continue for the first kernel launch until most of the blocks have retired. Then the second kernel will start executing.

Related

how many cache line are transmit from memory to l1 cache

here is my computer cache config
LEVEL1_ICACHE_SIZE 32768
LEVEL1_ICACHE_ASSOC 8
LEVEL1_ICACHE_LINESIZE 64
LEVEL1_DCACHE_SIZE 32768
LEVEL1_DCACHE_ASSOC 8
LEVEL1_DCACHE_LINESIZE 64
LEVEL2_CACHE_SIZE 1048576
LEVEL2_CACHE_ASSOC 16
LEVEL2_CACHE_LINESIZE 64
LEVEL3_CACHE_SIZE 34603008
LEVEL3_CACHE_ASSOC 11
LEVEL3_CACHE_LINESIZE 64
LEVEL4_CACHE_SIZE 0
LEVEL4_CACHE_ASSOC 0
LEVEL4_CACHE_LINESIZE 0
here is my code:
#include <stdlib.h>
#define MISS 39321600
int main(){
int *p = (int*)malloc(MISS * sizeof(int));
int i;
for(i=0; i<MISS; i++) {
p[i] = 1;
}
}
here is gcc version: gcc version 8.3.1 20190604
compile cmd: g++ main.cpp
cpu: Intel(R) Xeon(R) Platinum 8163 CPU # 2.50GHz
then,perf it with cmd : perf stat -e cache-misses ./a.out
the cache-miss count is 400000 around
when i expect is 39321600/(64/4)=2457600 around
is 2457600/400000≈6 cache line get from memory once?

OpenMP vectorised code runs way slower than O3 optimized code

I have a minimally reproducible sample which is as follows -
#include <iostream>
#include <chrono>
#include <immintrin.h>
#include <vector>
#include <numeric>
template<typename type>
void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){
for(size_t i=0; i < size * size; i++){
result[i] = matA[i] + matB[i];
}
}
int main(){
size_t size = 8192;
//std::cout<<sizeof(double) * 8<<std::endl;
auto matA = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
auto matB = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
auto result = (float*) aligned_alloc(sizeof(float), size * size * sizeof(float));
for(int i = 0; i < size * size; i++){
*(matA + i) = i;
*(matB + i) = i;
}
auto start = std::chrono::high_resolution_clock::now();
for(int j=0; j<500; j++){
AddMatrixOpenMP<float>(matA, matB, result, size);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
std::cout<<"Average Time is = "<<duration/500<<std::endl;
std::cout<<*(result + 100)<<" "<<*(result + 1343)<<std::endl;
}
I experiment as follows - I time the code with #pragma omp for simd directive for the loop in the AddMatrixOpenMP function and then time it without the directive. I compile the code as follows -
g++ -O3 -fopenmp example.cpp
Upon inspecting the assembly, both the variants generate vector instructions but when the OpenMP pragma is explicitly specified, the code runs 3 times slower.
I am not able to understand why so.
Edit - I am running GCC 9.3 and OpenMP 4.5. This is running on an i7 9750h 6C/12T on Ubuntu 20.04. I ensured no major processes were running in the background. The CPU frequency held more or less constant during the run for both versions (Minor variations from 4.0 to 4.1)
TIA
The non-OpenMP vectorizer is defeating your benchmark with loop inversion.
Make your function __attribute__((noinline, noclone)) to stop GCC from inlining it into the repeat loop. For cases like this with large enough functions that call/ret overhead is minor, and constant propagation isn't important, this is a pretty good way to make sure that the compiler doesn't hoist work out of the loop.
And in future, check the asm, and/or make sure the benchmark time scales linearly with the iteration count. e.g. increasing 500 up to 1000 should give the same average time in a benchmark that's working properly, but it won't with -O3. (Although it's surprisingly close here, so that smell test doesn't definitively detect the problem!)
After adding the missing #pragma omp simd to the code, yeah I can reproduce this. On i7-6700k Skylake (3.9GHz with DDR4-2666) with GCC 10.2 -O3 (without -march=native or -fopenmp), I get 18266, but with -O3 -fopenmp I get avg time 39772.
With the OpenMP vectorized version, if I look at top while it runs, memory usage (RSS) is steady at 771 MiB. (As expected: init code faults in the two inputs, and the first iteration of the timed region writes to result, triggering page-faults for it, too.)
But with the "normal" vectorizer (not OpenMP), I see the memory usage climb from ~500 MiB until it exits just as it reaches the max 770MiB.
So it looks like gcc -O3 performed some kind of loop inversion after inlining and defeated the memory-bandwidth-intensive aspect of your benchmark loop, only touching each array element once.
The asm shows the evidence: GCC 9.3 -O3 on Godbolt doesn't vectorize, and it leaves an empty inner loop instead of repeating the work.
.L4: # outer loop
movss xmm0, DWORD PTR [rbx+rdx*4]
addss xmm0, DWORD PTR [r13+0+rdx*4] # one scalar operation
mov eax, 500
.L3: # do {
sub eax, 1 # empty inner loop after inversion
jne .L3 # }while(--i);
add rdx, 1
movss DWORD PTR [rcx], xmm0
add rcx, 4
cmp rdx, 67108864
jne .L4
This is only 2 or 3x faster than fully doing the work. Probably because it's not vectorized, and it's effectively running a delay loop instead of optimizing away the empty inner loop entirely. And because modern desktops have very good single-threaded memory bandwidth.
Bumping up the repeat count from 500 to 1000 only improved the computed "average" from 18266 to 17821 us per iter. An empty loop still takes 1 iteration per clock. Normally scaling linearly with the repeat count is a good litmus test for broken benchmarks, but this is close enough to be believable.
There's also the overhead of page faults inside the timed region, but the whole thing runs for multiple seconds so that's minor.
The OpenMP vectorized version does respect your benchmark repeat-loop. (Or to put it another way, doesn't manage to find the huge optimization that's possible in this code.)
Looking at memory bandwidth while the benchmark is running:
Running intel_gpu_top -l while the proper benchmark is running shows (openMP, or with __attribute__((noinline, noclone))). IMC is the Integrated Memory Controller on the CPU die, shared by the IA cores and the GPU via the ring bus. That's why a GPU-monitoring program is useful here.
$ intel_gpu_top -l
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
0 0 0 97 0.00 20421 7482 0.00 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 14 99 0.02 19627 6505 0.47 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 20 98 0.02 19625 6516 0.67 0 0 0.00 0 0 0.00 0 0 0.00 0 0
11 10 22 98 0.03 19632 6516 0.65 0 0 0.00 0 0 0.00 0 0 0.00 0 0
3 4 13 99 0.02 19609 6505 0.46 0 0 0.00 0 0 0.00 0 0 0.00 0 0
Note the ~19.6GB/s read / 6.5GB/s write. Read ~= 3x write since it's not using NT stores for the output stream.
But with -O3 defeating the benchmark, with a 1000 repeat count, we see only near-idle levels of main-memory bandwidth.
Freq MHz IRQ RC6 Power IMC MiB/s RCS/0 BCS/0 VCS/0 VECS/0
req act /s % W rd wr % se wa % se wa % se wa % se wa
...
8 8 17 99 0.03 365 85 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
9 9 17 99 0.02 349 90 0.62 0 0 0.00 0 0 0.00 0 0 0.00 0 0
4 4 5 100 0.01 303 63 0.25 0 0 0.00 0 0 0.00 0 0 0.00 0 0
7 7 15 100 0.02 345 69 0.43 0 0 0.00 0 0 0.00 0 0 0.00 0 0
10 10 21 99 0.03 350 74 0.64 0 0 0.00 0 0 0.00 0 0 0.00 0 0
vs. a baseline of 150 to 180 MB/s read, 35 to 50MB/s write when the benchmark isn't running at all. (I have some programs running that don't totally sleep even when I'm not touching the mouse / keyboard.)

how to differentiate GPU threads in a single GPU for different host CPU thread

When multiple CPU thread dispatch jobs to a single GPU, what's the best way to differentiate GPU threads so that the multiple CPU thread does not simply repeat each other
the following code calculate the sum of two large arrays element by element. The correct result is: 3.0. When using 1 CPU, the code do the right thing. Then running with 8 CPUs, the output becomes 10 because the kernel duplicate the calculation 8 times. I'm looking for a way such that each CPU calculate 1/8 of the sum that not duplicate each other.
#include <iostream>
#include <math.h>
#include <thread>
#include <vector>
#include <cuda.h>
using namespace std;
const unsigned NUM_THREADS = std::thread::hardware_concurrency();
// Kernel function to add the elements of two arrays
__global__
void add_2(int n, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i < n) {
y[i] = x[i] + y[i];
}
}
//
void thread_func(int N, float *x, float *y, int idx_thread)
{
cudaSetDevice(0);
int blockSize;
int minGridSize;
int gridSize;
cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, add_2, 0, N);
// Round up according to array size
gridSize = (N + blockSize - 1) / blockSize;
//gridSize /= NUM_THREADS +1;
cout<<"blockSize: "<<blockSize<<" minGridSize: "<<minGridSize<<" gridSize: "<<gridSize<<endl;
// Run kernel on 1M elements on the GPU
add_2<<<gridSize, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
}
//
int main()
{
int N = 1<<20;
float *x, *y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
//.. begin multithreading ..
vector<std::thread> t;
for(int i = 0; i<NUM_THREADS; i++)
t.push_back(thread(thread_func, N, x, y, i));
for(int i = 0; i<NUM_THREADS; i++)
t[i].join();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++) {
if(!(i%10000))
std::cout<<i<<" "<<y[i]<<std::endl;
maxError = fmax(maxError, fabs(y[i]-3.0f));
}
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
Output:
blockSize: 1024 minGridSize: 16 gridSize: 1024
..........
blockSize: 1024 minGridSize: 16 gridSize: 1024
0 10
10000 10
20000 10
...
1020000 10
1030000 10
1040000 10
Max error: 7
The solution for this very simple case is to divide up your array into pieces, one piece per thread. For simplicity so that I don't have to handle a bunch of annoying corner case issues, lets assume that your array size (N) is whole-number divisible by NUM_THREADS. It doesn't have to be this way, of course, but the arithmetic to divide it up isn't much different, but you have to handle rounding at each segment boundary, which I'd rather avoid.
Here's an example that works based on the above assumption. Each thread decides which portion of the array it is responsible for (based on its thread number and the total length) and only works on that section.
$ cat t1460.cu
#include <iostream>
#include <math.h>
#include <thread>
#include <vector>
#include <cuda.h>
using namespace std;
const unsigned NUM_THREADS = 8;
// Kernel function to add the elements of two arrays
__global__
void add_2(int n, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i < n) {
y[i] = x[i] + y[i];
}
}
//
void thread_func(int N, float *x, float *y, int idx_thread)
{
cudaSetDevice(0);
int blockSize = 512;
int worksize = N/NUM_THREADS; // assumes whole-number divisibility
int gridSize = (worksize+blockSize-1)/blockSize;
cout<<"blockSize: "<<blockSize<<" gridSize: "<<gridSize<<endl;
// Run kernel on 1M elements on the GPU
add_2<<<gridSize, blockSize>>>(worksize, x+(idx_thread*worksize), y+(idx_thread*worksize));
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
}
//
int main()
{
int N = 1<<20;
float *x, *y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
//.. begin multithreading ..
vector<std::thread> t;
for(int i = 0; i<NUM_THREADS; i++)
t.push_back(thread(thread_func, N, x, y, i));
for(int i = 0; i<NUM_THREADS; i++)
t[i].join();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++) {
if(!(i%10000))
std::cout<<i<<" "<<y[i]<<std::endl;
maxError = fmaxf(maxError, fabs(y[i]-3.0f));
}
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
$ nvcc t1460.cu -o t1460 -std=c++11
$ cuda-memcheck ./t1460
========= CUDA-MEMCHECK
blockSize: blockSize: 512 gridSize: 256512blockSize: gridSize:
blockSize: blockSize: 512blockSize: gridSize: 256512
gridSize: 256
blockSize: 512 gridSize: 256
blockSize: 512 gridSize: 256
512 gridSize: 256
256
512 gridSize: 256
0 3
10000 3
20000 3
30000 3
40000 3
50000 3
60000 3
70000 3
80000 3
90000 3
100000 3
110000 3
120000 3
130000 3
140000 3
150000 3
160000 3
170000 3
180000 3
190000 3
200000 3
210000 3
220000 3
230000 3
240000 3
250000 3
260000 3
270000 3
280000 3
290000 3
300000 3
310000 3
320000 3
330000 3
340000 3
350000 3
360000 3
370000 3
380000 3
390000 3
400000 3
410000 3
420000 3
430000 3
440000 3
450000 3
460000 3
470000 3
480000 3
490000 3
500000 3
510000 3
520000 3
530000 3
540000 3
550000 3
560000 3
570000 3
580000 3
590000 3
600000 3
610000 3
620000 3
630000 3
640000 3
650000 3
660000 3
670000 3
680000 3
690000 3
700000 3
710000 3
720000 3
730000 3
740000 3
750000 3
760000 3
770000 3
780000 3
790000 3
800000 3
810000 3
820000 3
830000 3
840000 3
850000 3
860000 3
870000 3
880000 3
890000 3
900000 3
910000 3
920000 3
930000 3
940000 3
950000 3
960000 3
970000 3
980000 3
990000 3
1000000 3
1010000 3
1020000 3
1030000 3
1040000 3
Max error: 0
========= ERROR SUMMARY: 0 errors
$
Of course, for this trivial example, there's no particular benefit to using 4 CPU threads. I assume that what was being asked here was for a design pattern to enable other activity. Multiple CPU threads might be a convenient way to arrange other work. For example, I might have a system that is processing data from 4 cameras. It might be convenient to organize my camera processing as 4 independent threads, one for each camera. That system might only have 1 GPU, and it's certainly plausible that each of the 4 threads might want to issue independent work to that GPU. This design pattern could easily be adapted to that use case, to pick one example. It might even be that the 4 camera CPU threads would need to combine some data into a single array on the GPU, and this pattern could be used in that case.
When multiple CPU thread dispatch jobs to a single GPU, what's the best way to differentiate GPU threads so that the multiple CPU thread does not simply repeat each other
Let me answer that more generally than regarding your specific example:
There is no inherent benefit in using multiple threads to enqueue work on a GPU. If you have each thread wait on a CUDA queue, then it could make sense, but that's not necessarily the right thing to do.
Unless you're explicitly scheduling memory transfers, there's no guaranteed inherent benefit to split up the work you schedule into small pieces. You could just schedule a single kernel to add up the entire array. Remember - a kernel is made up of thousands or millions of 'threads' on the GPU side; CPU threads don't help GPU parallelism at all.
It makes more sense to have different threads schedule work when they come to realize it exists independently of each other.
It is often a good idea to write a kernel's output someplace different than its input. It requires more memory during the computation, but it prevents the kind of problems you describe - of overlapping changes of the same value, of having to carefully consider which scheduled kernel executes first etc. Thus, for example, you could have implemented:
__global__ void add_2(int n, float* result, const float *x, const float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
z[i] = x[i] + y[i];
}
}
if you can't do that, then you need careful partitioning of the the input-output arrays to schedule the work, as suggested in #RobertCrovella's answer.
Use the __restrict__ keyword (even though it's not standard C++) to indicate that the areas the parameters point to don't overlap. That speeds things up. See:
CUDA: __restrict__ tag usage
The __restrict__ section of the CUDA C programming guide.

Huge latency spikes while running simple code

I have a simple benchmark that demonstrates performance of busywait threads. It runs in two modes: first one simply gets two timepoints sequentially, second one iterates through vector and measures duration of an iteration.
I see that two sequential calls of clock::now() takes about 50 nanoseconds on the average and one average iteration through vector takes about 100 nanoseconds. But sometimes these operations are executed with a huge delay: about 50 microseconds in the first case and 10 milliseconds (!) in the second case.
Test runs on single isolated core so context switches do not occur. I also call mlockall in beginning of the program so I assume that page faults do not affect the performance.
Following additional optimizations were also applied:
kernel boot parameters: intel_idle.max_cstate=0 idle=halt
irqaffinity=0,14 isolcpus=4-13,16-27 pti=off spectre_v2=off audit=0
selinux=0 nmi_watchdog=0 nosoftlockup=0 rcu_nocb_poll rcu_nocbs=19-20
nohz_full=19-20;
rcu[^c] kernel threads moved to a housekeeping CPU core 0;
network card RxTx queues moved to a housekeeping CPU core 0;
writeback kernel workqueue moved to a housekeeping CPU core 0;
transparent_hugepage disabled;
Intel CPU HyperThreading disabled;
swap file/partition is not used.
Environment:
System details:
Default Archlinux kernel:
5.1.9-arch1-1-ARCH #1 SMP PREEMPT Tue Jun 11 16:18:09 UTC 2019 x86_64 GNU/Linux
that has following PREEMPT and HZ settings:
CONFIG_HZ_300=y
CONFIG_HZ=300
CONFIG_PREEMPT=y
Hardware details:
RAM: 256GB
CPU(s): 28
On-line CPU(s) list: 0-27
Thread(s) per core: 1
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 # 2.60GHz
Stepping: 1
CPU MHz: 3200.011
CPU max MHz: 3500.0000
CPU min MHz: 1200.0000
BogoMIPS: 5202.68
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-13
NUMA node1 CPU(s): 14-27
Example code:
struct TData
{
std::vector<char> Data;
TData() = default;
TData(size_t aSize)
{
for (size_t i = 0; i < aSize; ++i)
{
Data.push_back(i);
}
}
};
using TBuffer = std::vector<TData>;
TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex)
{
if (!aPerform)
{
return TData {};
}
const TData& result = aBuffer[outBufferIndex];
if (++outBufferIndex == aBuffer.size())
{
outBufferIndex = 0;
}
return result;
}
void WarmUp(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer)
{
size_t bufferIndex = 0;
for (size_t i = 0; i < aCyclesCount; ++i)
{
auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
}
}
void TestCycle(size_t aCyclesCount, bool aPerform, const TBuffer& aBuffer, Measurings& outStatistics)
{
size_t bufferIndex = 0;
for (size_t i = 0; i < aCyclesCount; ++i)
{
auto t1 = std::chrono::steady_clock::now();
{
auto data = DoMemoryOperation(aPerform, aBuffer, bufferIndex);
}
auto t2 = std::chrono::steady_clock::now();
auto diff = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
outStatistics.AddMeasuring(diff, t2);
}
}
int Run(int aCpu, size_t aDataSize, size_t aBufferSize, size_t aCyclesCount, bool aAllocate, bool aPerform)
{
if (mlockall(MCL_CURRENT | MCL_FUTURE))
{
throw std::runtime_error("mlockall failed");
}
std::cout << "Test parameters"
<< ":\ndata size=" << aDataSize
<< ",\nnumber of elements=" << aBufferSize
<< ",\nbuffer size=" << aBufferSize * aDataSize
<< ",\nnumber of cycles=" << aCyclesCount
<< ",\nallocate=" << aAllocate
<< ",\nperform=" << aPerform
<< ",\nthread ";
SetCpuAffinity(aCpu);
TBuffer buffer;
if (aPerform)
{
buffer.resize(aBufferSize);
std::fill(buffer.begin(), buffer.end(), TData { aDataSize });
}
WaitForKey();
std::cout << "Running..."<< std::endl;
WarmUp(aBufferSize * 2, aPerform, buffer);
Measurings statistics;
TestCycle(aCyclesCount, aPerform, buffer, statistics);
statistics.Print(aCyclesCount);
WaitForKey();
if (munlockall())
{
throw std::runtime_error("munlockall failed");
}
return 0;
}
And following results are received:
First:
StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=0
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
allocate=1,
perform=0,
thread 14056 on cpu 19
Statistics: min: 16: max: 18985: avg: 18
0 - 10 : 0 (0 %): -
10 - 100 : 999993494 (99 %): min: 40: max: 117130: avg: 40
100 - 1000 : 946 (0 %): min: 380: max: 506236837: avg: 43056598
1000 - 10000 : 5549 (0 %): min: 56876: max: 70001739: avg: 7341862
10000 - 18985 : 11 (0 %): min: 1973150818: max: 14060001546: avg: 3644216650
Second:
StandaloneTests --run_test=MemoryAccessDelay --cpu=19 --data-size=280 --size=67108864 --count=1000000000 --allocate=1 --perform=1
Test parameters:
data size=280,
number of elements=67108864,
buffer size=18790481920,
number of cycles=1000000000,
allocate=1,
perform=1,
thread 3264 on cpu 19
Statistics: min: 36: max: 4967479: avg: 48
0 - 10 : 0 (0 %): -
10 - 100 : 964323921 (96 %): min: 60: max: 4968567: avg: 74
100 - 1000 : 35661548 (3 %): min: 122: max: 4972632: avg: 2023
1000 - 10000 : 14320 (0 %): min: 1721: max: 33335158: avg: 5039338
10000 - 100000 : 130 (0 %): min: 10010533: max: 1793333832: avg: 541179510
100000 - 1000000 : 0 (0 %): -
1000000 - 4967479 : 81 (0 %): min: 508197829: max: 2456672083: avg: 878824867
Any ideas what is the reason of such huge delays and how it may be investigated?
In:
TData DoMemoryOperation(bool aPerform, const TBuffer& aBuffer, size_t& outBufferIndex);
It returns a std::vector<char> by value. That involves a memory allocation and data copying. The memory allocations can do a syscall (brk or mmap) and memory mappings related syscalls are notorious for being slow.
When timings include syscalls one cannot expect low variance.
You may like to run your application with /usr/bin/time --verbose <app> or perf -ddd <app> to see the number of page faults and context switches.

OMP: How to find the right size of cache at runtime

I have the following piece of (pseudo) code:
static void ConvertBuffer( unsigned char * buffer, const int width )
{
#pragma omp parallel for
for ( int x = 0; x < width; ++x ) // one image row
{
RGB rgb = {0,0,0}; HSB hsb;
rgb.red = (float)buffer[x] / 255.;
RGBToHSB(rgb, hsb);
buffer[x] = hsb.brightness * 255;
}
}
This is a very naive implementation of an RGB → HSB conversion algorithm.
The first implementation would pull a single scanline (=one row of the image) at a time, in my case 65536 bytes. However after trial and error on my particular system, I discovered that I could decrease the total computation time by a factor of 2, if instead I would process 16 scanlines at a time (= 1048576 bytes).
What tool are available for me to guess that magic number, possibly at runtime so that I do not need to hard-code a magical value of 16 somewhere in my code ?
If I know that RGBToHSB is embarrassingly parallel and cache friendly, can I just completely fill the L3 cache and that should be close to the maximum possible speed ?
For reference, my system is described by:
$ sudo likwid-topology
-------------------------------------------------------------
CPU type: Intel Core SandyBridge processor
*************************************************************
Hardware Thread Topology
*************************************************************
Sockets: 1
Cores per socket: 4
Threads per core: 1
-------------------------------------------------------------
HWThread Thread Core Socket
0 0 0 0
1 0 1 0
2 0 2 0
3 0 3 0
-------------------------------------------------------------
Socket 0: ( 0 1 2 3 )
-------------------------------------------------------------
*************************************************************
Cache Topology
*************************************************************
Level: 1
Size: 32 kB
Cache groups: ( 0 ) ( 1 ) ( 2 ) ( 3 )
-------------------------------------------------------------
Level: 2
Size: 256 kB
Cache groups: ( 0 ) ( 1 ) ( 2 ) ( 3 )
-------------------------------------------------------------
Level: 3
Size: 6 MB
Cache groups: ( 0 1 2 3 )
-------------------------------------------------------------
*************************************************************
NUMA Topology
*************************************************************
NUMA domains: 1
-------------------------------------------------------------
Domain 0:
Processors: 0 1 2 3
Relative distance to nodes: 10
Memory: 122.332 MB free of total 5898.17 MB
-------------------------------------------------------------
You can't really define a 'right size' for buffering. My answer would be to set it as big as reasonably possible. I would say somewhere between 10MB and 100MB, but you can set it higher if you can afford it, or lower if you are short on RAM.
If you are reading a file and writing to a file (same or another), you should consider using memory mapped files. This way you get rid of the buffering (managed by the OS), and you can call your function once for the whole image. Note that this is probably not a good idea on a 32-bit system if your image is bigger than 4GB.