About thrust::execution_policy when copying data from device to host - c++

I use thrust::copy to transfer data from device to host in a multi-GPU system. Each GPU has a equally sized partition of the data. Using OpenMP, I call the function on each device. On my current system I am working on 4 GPUs.
#pragma omp parallel for
for (size_t i = 0; i < devices.size(); ++i)
{
const int device = devices[i];
thrust::copy(thrust::device, // execution policy
device_buffers->At(device)->begin(), // thrust::device_vector
device_buffers->At(device)->end(),
elements->begin() + (device * block_size)); // thrust::host_vector
}
After reading the documentation and the following post, I understand that the default thrust::execution_policy is chosen based on the iterators that are passed.
When copying data from device to host, both iterators are passed as
function parameters.
1. Which execution policy is picked here per default? thrust::host or thrust::device?
After doing some benchmarks, I observe that passing thrust::device
explicitly improves performance, compared to not passing an explicit
parameter.
2. What could be the reason for the performance gain? The system is a POWER9 machine. How does thrust::copy and the specific execution
policy work internally? How many of the 4 copy engines of each
device are actually used?
However, nvprof does not display the [CUDA memcpy DtoH] category
anymore and instead shows void thrust::cuda_cub::core [...]
__parallel_for::ParallelForAgent [...] which even shows an increase in Time (s). This does not make sense because, as I said, I observed
a consistent performance improvement (smaller total execution time)
when using thrust::device.
3. Is this just a nvprof + thrust-specific behaviour that causes profiling numbers not to correlate with acutal execution time? I
observed something similiar for cudaFree: It seems that cudaFree is
returning control to the host code pretty fast which results in small
execution time while nvprof shows much higher numbers because the
actual deallocation probably happens in lazy fashion.

The Thrust doc on the thrust::device states the following:
Raw pointers allocated by host APIs should not be mixed with a thrust::device algorithm invocation when the device backend is CUDA
To my understanding, this means that host-device copy with thrust::device execution policy is invalid, in the first place, unless the host memory is pinned.
We imply that your host allocation is not pinned, BUT: One possibility is that on POWER9 with NVLINK you may be lucky that any host-allocated memory is addressable from within the GPU. Thanks to that, host-device copy with thrust::device works, though it should not.
On a regular system, host memory is addressable from within a GPU only if this host memory is allocated with cudaMallocHost (pinned). So, the question is whether your POWER system has automagically upgraded all allocations to be pinned. Is the observed performance bonus due to the implicitly-pinned memory, or would you get an additional speedup, if allocations are also done with cudaMallocHost explicitly?
Another Thrust design-based evidence is that thrust::device policy has par.on(stream) support, while thrust::host does not. This is pretty much aligned with the fact that asynchronous host-device copies are only possible with the pinned memory.

Related

concurrent data transfer cuda kernel and host [duplicate]

I have some questions.
Recently I'm making a program by using CUDA.
In my program, there is one big data on Host programmed with std::map(string, vector(int)).
By using these datas some vector(int) are copied to GPUs global memory and processed on GPU
After processing, some results are generated on GPU and these results are copied to CPU.
These are all my program schedule.
cudaMemcpy( ... , cudaMemcpyHostToDevice)
kernel function(kernel function only can be done when necessary data is copied to GPU global memory)
cudaMemcpy( ... , cudaMemcpyDeviceToHost)
repeat 1~3steps 1000times (for another data(vector) )
But I want to reduce processing time.
So I decided to use cudaMemcpyAsync function in my program.
After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.
But my programs are using std::map, so I couldn't make this std::map data to pinned memory.
So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.
Finally, my program worked like this.
Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
kernel(kernel function only can be executed when whole data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~4steps 1000times (for another data(vector) )
And my program became much faster than the previous case.
But problem(my curiosity) is at this point.
I tried to make another program in a similar way.
Memcpy (copy data from std::map to buffer only for one vector)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
loop 1~2 until whole data is copied to GPU global memory
kernel(kernel function only can be executed when necessary data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~5steps 1000times (for another data(vector) )
This method came out to be about 10% faster than the method discussed above.
But I don't know why.
I think cudaMemcpyAsync only can be overlapped with kernel function.
But my case I think it is not. Rather than it looks like can be overlapped between cudaMemcpyAsync functions.
Sorry for my long question but I really want to know why.
Can Someone teach or explain to me what is the exact facility "cudaMemcpyAsync" and what functions can be overlapped with "cudaMemcpyAsync" ?
The copying activity of cudaMemcpyAsync (as well as kernel activity) can be overlapped with any host code. Furthermore, data copy to and from the device (via cudaMemcpyAsync) can be overlapped with kernel activity. All 3 activities: host activity, data copy activity, and kernel activity, can be done asynchronously to each other, and can overlap each other.
As you have seen and demonstrated, host activity and data copy or kernel activity can be overlapped with each other in a relatively straightforward fashion: kernel launches return immediately to the host, as does cudaMemcpyAsync. However, to get best overlap opportunities between data copy and kernel activity, it's necessary to use some additional concepts. For best overlap opportunities, we need:
Host memory buffers that are pinned, e.g. via cudaHostAlloc()
Usage of cuda streams to separate various types of activity (data copy and kernel computation)
Usage of cudaMemcpyAsync (instead of cudaMemcpy)
Naturally your work also needs to be broken up in a separable way. This normally means that if your kernel is performing a specific function, you may need multiple invocations of this kernel so that each invocation can be working on a separate piece of data. This allows us to copy data block B to the device while the first kernel invocation is working on data block A, for example. In so doing we have the opportunity to overlap the copy of data block B with the kernel processing of data block A.
The main differences with cudaMemcpyAsync (as compared to cudaMemcpy) are that:
It can be issued in any stream (it takes a stream parameter)
Normally, it returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.
Item 1 is a necessary feature so that data copy can be overlapped with kernel computation. Item 2 is a necessary feature so that data copy can be overlapped with host activity.
Although the concepts of copy/compute overlap are pretty straightforward, in practice the implementation requires some work. For additional references, please refer to:
Overlap copy/compute section of the CUDA best practices guide.
Sample code showing a basic implementation of copy/compute overlap.
Sample code showing a full multi/concurrent kernel copy/compute overlap scenario.
Note that some of the above discussion is predicated on having a compute capability 2.0 or greater device (e.g. concurrent kernels). Also, different devices may have one or 2 copy engines, meaning simultaneous copy to the device and copy from the device is only possible on certain devices.

Apparent CUDA magic

I'm using CUDA (in reality I'm using pyCUDA if the difference matters) and performing some computation over arrays. I'm launching a kernel with a grid of 320*600 threads. Inside the kernel I'm declaring two linear arrays of 20000 components using:
float test[20000]
float test2[20000]
With these arrays I perform simple calculations, like for example filling them with constant values. The point is that the kernel executes normally and perform correctly the computations (you can see this filling an array with a random component of test and sending that array to host from device).
The problem is that my NVIDIA card has only 2GB of memory and the total amount of memory to allocate the arrays test and test2 is 320*600*20000*4 bytes that is much more than 2GB.
Where is this memory coming from? and how can CUDA perform the computation in every thread?
Thank you for your time
The actual sizing of the local/stack memory requirements is not as you suppose (for the entire grid, all at once) but is actually based on a formula described by #njuffa here.
Basically, the local/stack memory require is sized based on the maximum instantaneous capacity of the device you are running on, rather than the size of the grid.
Based on the information provided by njuffa, the available stack size limit (per thread) is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
available GPU memory/(#of SMs)/(max threads per SM)
For your first case:
float test[20000];
float test2[20000];
That total is 160KB (per thread) so we are under the maximum limit of 512KB per thread. What about the 2nd limit?
GTX 650m has 2 cc 3.0 (kepler) SMs (each Kepler SM has 192 cores). Therefore, the second limit above gives, if all the GPU memory were available:
2GB/2/2048 = 512KB
(kepler has 2048 max threads per multiprocessor)
so it is the same limit in this case. But this assumes all the GPU memory is available.
Since you're suggesting in the comments that this configuration fails:
float test[40000];
float test2[40000];
i.e. 320KB, I would conclude that your actual available GPU memory is at the point of this bulk allocation attempt is somewhere above (160/512)*100% i.e. above 31% but below (320/512)*100% i.e. below 62.5% of 2GB, so I would conclude that your available GPU memory at the time of this bulk allocation request for the stack frame would be something less than 1.25GB.
You could try to see if this is the case by calling cudaGetMemInfo right before the kernel launch in question (although I don't know how to do this in pycuda). Even though your GPU starts out with 2GB, if you are running the display from it, you are likely starting with a number closer to 1.5GB. And dynamic (e.g. cudaMalloc) and or static (e.g. __device__) allocations that occur prior to this bulk allocation request at kernel launch, will all impact available memory.
This is all to explain some of the specifics. The general answer to your question is that the "magic" arises due to the fact that the GPU does not necessarily allocate the stack frame and local memory for all threads in the grid, all at once. It need only allocate what is required for the maximum instantaneous capacity of the device (i.e. SMs * max threads per SM), which may be a number that is significantly less than what would be required for the whole grid.

CUDA boolean variable for host loop

A very simplified version of my code looks like:
do {
//reset loop variable b to 0/false
b = 0;
// execute kernel
kernel<<<...>>>(b);
// use the value of b for while condition
} while(b);
Boolean variable b can be set to true by any thread in kernel and it tells us whether we continue running our loop.
Using cudaMalloc, cudaMemset, and cudaMemcpy we can create/set/copy device memory to implement this. However I just found the existence of pinned memory. Using cudaMalloHost to allocate b and a call to cudaDeviceSynchronize right after the kernel gave quite a speed up (~50%) in a simple test program.
Is pinned memory the best option for this boolean variable b or is there a better option?
You haven't shown your initial code and the modified code therefore nobody can have any idea about the details of the improvement you are stating in your post.
The answer to your question varies depending on
The b is read and written or is only written inside the GPU kernel. Reads might need to fetch the actual value directly from the host side if b is not found in the cache resulting in latencies. On the other hand, the latency for writes can be covered if there are further operations that can keep the threads busy.
How frequent you modify the value. If you access the value frequently in your program, the GPU probably can keep the variable inside L2 avoiding host side accesses.
The frequency of memory operations between accesses to b. If there are many memory transactions between accesses to b, it is more probable that b in the cache is replaced with some other content. As a result, when accessed again, b could not be found in the cache and a time-consuming host-access is necessary.
In cases having b in the host side causes many host memory transactions, it is logical to keep it inside the GPU global memory and transfer it back at the end of each loop iteration. You can do it rather fast with an asynchronous copy in the same stream as kernel's and synchronize with the host right after.
All above items are for cache-enabled devices. If your device is pr-Fermi (CC<2.0), the story is different.

Is the pointer in c a normal variable or more than this

i'm really confused when i write
int *ptr;
is this just a normal variable just hold the address of another variable or its a complex thing located in CPU registers for direct access
i need a clear answer is the pointer a variable or not?
A pointer is a variable, in that it can be changed to point to other instances of the same data type.
In most processors, it is represented as an address within the processor's full address range.
The code generated by the compiler may emit code to load the value of the pointer variable into a register, then emit code to operate on the register. One operation would be to dereference the pointer. In other words, the compiler emits code to load a register with the value at the address represented by the pointer. This is also known as indirection.
Although direct access is faster than indirect access, the difference in execution time is usually negligible. For example, if a direct access takes 50 nanoseconds and indirection takes 60 nanoseconds, the difference would be 10 nanoseconds. Your program would need to perform 100000 or more indirections to make a noticeable time difference. There are special cases where this kind of optimizations is necessary; but not for most applications. The time wasted waiting for User input or I/O from a hard drive makes the time difference between direct memory access and indirect memory access insignificant.
The fastest "variable" accesses are listed in order:
Processor Register
Direct fetching from the data cache.
Direct fetching from memory on the chip, but outside the CPU core.
Indirect fetching from memory on the chip, but outside the CPU core.
Direct fetching from memory off of the System On a Chip.
Indirect fetching from memory off of the System On a Chip.
Fetching data from an I/O port.
If you think that indirection is still of concern, profile your code. For high accuracy:
Find a test point (TP) on the hardware, LED, or some place you can
connect an oscilloscope probe to.
Assert the test point.
Perform the operations at least 100,000 iterations.
Deassert the test point.
Measure the width of pulse shown by the oscilloscope.
Another method is to read the system clock, perform 1E09 iterations, read clock again. Subtract the two clock readings.

is using cudaHostAlloc good for my case

i have a kernel launched several times, untill a solution is found. the solution will be found by at least one block.
therefore when a block finds the solution it should inform the cpu that the solution is found, so the cpu prints the solution provided by this block.
so what i am currently doing is the following:
__global__ kernel(int sol)
{
//do some computations
if(the block found a solution)
sol = blockId.x //atomically
}
now on every call to the kernel i copy sol back to the host memory and check its value. if its set to 3 for example, i know that blockid 3 found the solution so i now know where the index of the solution start, and copy the solution back to the host.
in this case, will using cudaHostAlloc be a better option? more over would copying the value of a single integer on every kernel call slows down my program?
Issuing a copy from GPU to CPU and then waiting for its completion will slow your program a bit. Note that if you choose to send 1 byte or 1KB, that won't make much of a difference. In this case bandwidth is not a problem, but latency.
But launching a kernel does consume some time as well. If the "meat" of your algorithm is in the kernel itself I wouldn't spend too much time on that single, small transfer.
Do note, if you choose to use the mapped memory, instead of using cudaMemcpy, you will need to explicitly put a cudaDeviceSynchronise (or cudaThreadSynchronise with older CUDA) barrier (as opposed to an implicit barrier at cudaMemcpy) before reading the status. Otherwise, your host code may go achead reading an old value stored in your pinned memory, before the kernel overwrites it.