Long cuMemToHostAlloc call after exiting a kernel with copyout - fortran

I am accelerating a Fortran code with OpenACC. When I profile the program with NVIDIA Nsight, I noticed the first call of a kernel with a copyout clause exhibited a long call to cuMemToHostAlloc.
Here is a trivial example illustrating this. The program launches successively 10 times a kernel that computes an array test and returns its value:
program test
implicit none
real, allocatable :: test(:)
integer :: i, j, n, m
n = 1000
m = 10
do j = 1, m
!$acc kernels copyout(test)
!$acc loop independent
do i = 1, n
test(i) = real(i)
end do
!$acc end kernels
end do
end program test
The code is compiled with NVHPC 22.7, using no optimization flag (adding such flags did not have any influence).
The profiling of the code gives:
Compared to the actual memory transfer time, as seen for the 9 other calls, the call to cuMemToHostAlloc is ridiculously long.
If I remove the copyout clause, the call to cuMemToHostAlloc disappears, so this is related to copying back data from the device, but I do not understand why it only happens once and for so long.
Also, the test array is already allocated on the host memory.
Am I missing something?

It's the call to create the pinned memory buffers used to transfer the data between the host and device. DMA transfer must use non-swappable, i.e. pinned, memory.
We use a double buffering system where as one buffer is being filled with the virtual memory, the second buffer is transferred asynchronously to the device. Effectively hiding much of the virtual to pinned memory copy.
The host pinned memory allocation is relatively expensive but only occurs once when the runtime first encounters a data region so the cost will be amortized.
Note by removing the copyout, you're removing the need to transfer the data and hence no need for the buffers.


How to avoid constant memory copying in OpenCL

I wrote C++ application which is simulating simple heat flow. It is using OpenCL for computing.
OpenCL kernel is taking two-dimensional (n x n) array of temperatures values and its size (n). It returns new array with temperatures after each cycle:
int t_id = get_global_id(0);
if(t_id < n * n)
m_new[t_id / n][t_id % n] = average of its and its neighbors (top, bottom, left, right) temperatures
As You can see, every thread is computing single cell in matrix. When host application needs to perform X computing cycles it looks like this
For 1 ... X
Copy memory to OpenCL device
Call kernel
Copy memory back
I would like to rewrite kernel code to perform all X cycles without constant memory copying to/from OpenCL device.
Copy memory to OpenCL device
Call kernel X times OR call kernel one time and make it compute X cycles.
Copy memory back
I know that each thread in kernel should lock when all other threads are doing their job and after that - m[][] and m_new[][] should be swapped. I have no idea how to implement any of those two functionalities.
Or maybe there is another way to do this optimally?
Copy memory to OpenCL device
Call kernel X times
Copy memory back
this works. Make sure call kernel is not blocking(so 1-2 ms per cycle is saved) and there aren't any host-accesible buffer properties such as USE_HOST_PTR or ALLOC_HOST_PTR.
If calling kernel X times doesn't get satisfactory performance, you can try using single workgroup(such as only 256 threads) with looping X times that each cycles has a barrier() at the end so all 256 threads synchronize before starting next cycle. This way you can compute M different heat-flow problems at the same time where M is number of compute units(or workgroups) if that is a server, it can serve that many computations.
Global synchronization is not possible because when latest threads are launched, first threads are already gone. It works with (number of compute units)(number of threads per workgroup)(number of wavefronts per workgroup) threads concurrently. For example, a R7-240 gpu with 5 compute units and local-range=256, it can run maybe 5*256*20=25k threads at a time.
Then, for further performance, you can apply local-memory optimizations.

OpenCL variables or array in kernel cost memory?

I am trying to run the following code about OpenCL. In kernel function, I will define an array int arr[1000] = {0};
kernel void test()
int arr[1000] = {0};
Then I will create N threads to run the kernel.
cl::CommandQueue cmdQueue;
cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(N), cl::NullRange); // kernel here is the one running test()
My question is, since we know that OpenCL will parallel run the threads, does it mean that, the peak memory will be N * 1000 * sizeof(int)?
This is not the way to OpenCL (yes, that's what I meant :).
The kernel function operates on kernel operands passed in from the host (CPU) - so you'd allocate your array on the host using clCreateBuffer and set the arg using clSetKernelArg. Your kernel does not declare/allocate the device memory, but simply receives it as an __global argument. Now when you run the kernel using clEnqueueNDRangeKernel, the OpenCL implementation will allocate 1000 ints and run a thread on each of those ints.
If, on the other hand you meant to allocate 1000 ints per work-item (device thread), your calculation is right (yes, they cost memory from the local pool) but it probably won't work. OpenCL work-items have access to only local memory (see here on how to check this for your device) which is severely limited.

CUDA boolean variable for host loop

A very simplified version of my code looks like:
do {
//reset loop variable b to 0/false
b = 0;
// execute kernel
// use the value of b for while condition
} while(b);
Boolean variable b can be set to true by any thread in kernel and it tells us whether we continue running our loop.
Using cudaMalloc, cudaMemset, and cudaMemcpy we can create/set/copy device memory to implement this. However I just found the existence of pinned memory. Using cudaMalloHost to allocate b and a call to cudaDeviceSynchronize right after the kernel gave quite a speed up (~50%) in a simple test program.
Is pinned memory the best option for this boolean variable b or is there a better option?
You haven't shown your initial code and the modified code therefore nobody can have any idea about the details of the improvement you are stating in your post.
The answer to your question varies depending on
The b is read and written or is only written inside the GPU kernel. Reads might need to fetch the actual value directly from the host side if b is not found in the cache resulting in latencies. On the other hand, the latency for writes can be covered if there are further operations that can keep the threads busy.
How frequent you modify the value. If you access the value frequently in your program, the GPU probably can keep the variable inside L2 avoiding host side accesses.
The frequency of memory operations between accesses to b. If there are many memory transactions between accesses to b, it is more probable that b in the cache is replaced with some other content. As a result, when accessed again, b could not be found in the cache and a time-consuming host-access is necessary.
In cases having b in the host side causes many host memory transactions, it is logical to keep it inside the GPU global memory and transfer it back at the end of each loop iteration. You can do it rather fast with an asynchronous copy in the same stream as kernel's and synchronize with the host right after.
All above items are for cache-enabled devices. If your device is pr-Fermi (CC<2.0), the story is different.

Parallelization of elementwise matrix multiplication

I'm currently optimizing parts of my code and therefore perform some benchmarking.
I have NxN-matrices A and T and want to multiply them elementwise and save the result in A again, i.e. A = A*T. As this code is not parallelizable I expanded the assignment into
do j = 1, N
do i = 1, N
A(i,j) = T(i,j) * A(i,j)
end do
end do
(Full minimal working example at http://pastebin.com/RGpwp2KZ.)
The strange thing happening now is that regardless of the number of threads (between 1 and 4) the execution time stays more or less the same (+- 10%) but instead the CPU time increases with greater number of threads. That made me think that all the threads do the same work (because I made a mistake regarding OpenMP) and therefore need the same time.
But on another computer (that has 96 CPU cores available) the program behaves as expected: With increasing thread number the execution time decreases. Surprisingly the CPU time decreases as well (up to ~10 threads, then rising again).
It might be that there are different versions of OpenMP or gfortran installed. If this could be the cause it'd be great if you could tell me how to find that out.
You could in theory make Fortran array operations parallel by using the Fortran-specific OpenMP WORKSHARE directive:
A(:,:) = T(:,:) * A(:,:)
Note that though this is quite standard OpenMP code, some compilers, and most notably the Intel Fortran Compiler (ifort), implement the WORKSHARE construct simply by the means of the SINGLE construct, giving therefore no parallel speed-up whatsoever. On the other hand, gfortran converts this code fragment into an implicit PARALLEL DO loop. Note that gfortran won't parallelise the standard array notation A = T * A inside the worksharing construct unless it is written explicitly as A(:,:) = T(:,:) * A(:,:).
Now about the performance and the lack of speed-up. Each column of your A and T matrices occupies (2 * 8) * 729 = 11664 bytes. One matrix occupies 8.1 MiB and the two matrices together occupy 16.2 MiB. This probably exceeds the last-level cache size of your CPU. Also the multiplication code has very low compute intensity - it fetches 32 bytes of memory data per iteration and performs one complex multiplication in 6 FLOPs (4 real multiplications, 1 addition and 1 subtraction), then stores 16 bytes back to memory, which results in (6 FLOP)/(48 bytes) = 1/8 FLOP/byte. If the memory is considered to be full duplex, i.e. it supports writing while being read, then the intensity goes up to (6 FLOP)/(32 bytes) = 3/16 FLOP/byte. It follows that the problem is memory bound and even a single CPU core might be able to saturate all the available memory bandwidth. For example, a typical x86 core can retire two floating-point operations per cycle and if run at 2 GHz it could deliver 4 GFLOP/s of scalar math. To keep such core busy running your multiplication loop, the main memory should provide (4 GFLOP/s) * (16/3 byte/FLOP) = 21.3 GiB/s. This quantity more or less exceeds the real memory bandwidth of current generation x86 CPUs. And this is only for a single core with non-vectorised code. Adding more cores and threads would not increase the performance since the memory simply cannot deliver data fast enough to keep the cores busy. Rather, the performance will suffer since having more threads adds more overhead. When run on a multisocket system like the one with 96 cores, the program gets access to more last-level cache and to higher main memory bandwidth (assuming a NUMA system with a separate memory controller in each CPU socket), thus the performance increases, but only because there are more sockets and not because there are more cores.

is using cudaHostAlloc good for my case

i have a kernel launched several times, untill a solution is found. the solution will be found by at least one block.
therefore when a block finds the solution it should inform the cpu that the solution is found, so the cpu prints the solution provided by this block.
so what i am currently doing is the following:
__global__ kernel(int sol)
//do some computations
if(the block found a solution)
sol = blockId.x //atomically
now on every call to the kernel i copy sol back to the host memory and check its value. if its set to 3 for example, i know that blockid 3 found the solution so i now know where the index of the solution start, and copy the solution back to the host.
in this case, will using cudaHostAlloc be a better option? more over would copying the value of a single integer on every kernel call slows down my program?
Issuing a copy from GPU to CPU and then waiting for its completion will slow your program a bit. Note that if you choose to send 1 byte or 1KB, that won't make much of a difference. In this case bandwidth is not a problem, but latency.
But launching a kernel does consume some time as well. If the "meat" of your algorithm is in the kernel itself I wouldn't spend too much time on that single, small transfer.
Do note, if you choose to use the mapped memory, instead of using cudaMemcpy, you will need to explicitly put a cudaDeviceSynchronise (or cudaThreadSynchronise with older CUDA) barrier (as opposed to an implicit barrier at cudaMemcpy) before reading the status. Otherwise, your host code may go achead reading an old value stored in your pinned memory, before the kernel overwrites it.