So far I have written programs where a kernel is called only once in the program
So I have a kernel
__global__ void someKernel(float * d_in ){ //Any parameters
//some operation
}
and I basically do
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// Point to notice HERE
}
It works fine. However this time I want to call the kernel not only once but many times
Something like
main()
{
//create an array in device memory
cudaMalloc(......);
//move host data to that array
cudaMemcpy(......,cudaMemcpyHostToDevice);
//call the kernel
someKernel<< <nblocks,512>> >(.......);
//copy results to host memory
cudaMemcpy(......,cudaMemcpyDeviceToHost);
// From here
//Some unrelated calculations here
dothis();
dothat();
//Then again the kernel repeteadly
for(k: some_ks)
{
// Do some pre-calculations
//call the kernel
someKernel<< <nblocks,512>> >(.......);
// some post calculations
}
}
My question is should I use some kind of synchronization between calling the kernel the first time and calling the kernel in the for loops (and in each iteration)
Perhaps cudaDeviceSynchronize or other? or it is not necessary?
Additional synchronization would not be necessary in this case for at least 2 reasons.
cudaMemcpy is a synchronizing call already. It blocks the CPU thread and waits until all previous CUDA activity issued to that device is complete, before it allows the data transfer to begin. Once the data transfer is complete, the CPU thread is allowed to proceed.
CUDA activity issued to a single device will not overlap in any way unless using CUDA streams. You are not using streams. Therefore even asynchronous work issued to the device will execute in issue order. Item A and B issued to the device in that order will not overlap with each other. Item A will complete before item B is allowed to begin. This is a principal CUDA streams semantic point.
Related
I have some questions.
Recently I'm making a program by using CUDA.
In my program, there is one big data on Host programmed with std::map(string, vector(int)).
By using these datas some vector(int) are copied to GPUs global memory and processed on GPU
After processing, some results are generated on GPU and these results are copied to CPU.
These are all my program schedule.
cudaMemcpy( ... , cudaMemcpyHostToDevice)
kernel function(kernel function only can be done when necessary data is copied to GPU global memory)
cudaMemcpy( ... , cudaMemcpyDeviceToHost)
repeat 1~3steps 1000times (for another data(vector) )
But I want to reduce processing time.
So I decided to use cudaMemcpyAsync function in my program.
After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.
But my programs are using std::map, so I couldn't make this std::map data to pinned memory.
So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.
Finally, my program worked like this.
Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
kernel(kernel function only can be executed when whole data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~4steps 1000times (for another data(vector) )
And my program became much faster than the previous case.
But problem(my curiosity) is at this point.
I tried to make another program in a similar way.
Memcpy (copy data from std::map to buffer only for one vector)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
loop 1~2 until whole data is copied to GPU global memory
kernel(kernel function only can be executed when necessary data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~5steps 1000times (for another data(vector) )
This method came out to be about 10% faster than the method discussed above.
But I don't know why.
I think cudaMemcpyAsync only can be overlapped with kernel function.
But my case I think it is not. Rather than it looks like can be overlapped between cudaMemcpyAsync functions.
Sorry for my long question but I really want to know why.
Can Someone teach or explain to me what is the exact facility "cudaMemcpyAsync" and what functions can be overlapped with "cudaMemcpyAsync" ?
The copying activity of cudaMemcpyAsync (as well as kernel activity) can be overlapped with any host code. Furthermore, data copy to and from the device (via cudaMemcpyAsync) can be overlapped with kernel activity. All 3 activities: host activity, data copy activity, and kernel activity, can be done asynchronously to each other, and can overlap each other.
As you have seen and demonstrated, host activity and data copy or kernel activity can be overlapped with each other in a relatively straightforward fashion: kernel launches return immediately to the host, as does cudaMemcpyAsync. However, to get best overlap opportunities between data copy and kernel activity, it's necessary to use some additional concepts. For best overlap opportunities, we need:
Host memory buffers that are pinned, e.g. via cudaHostAlloc()
Usage of cuda streams to separate various types of activity (data copy and kernel computation)
Usage of cudaMemcpyAsync (instead of cudaMemcpy)
Naturally your work also needs to be broken up in a separable way. This normally means that if your kernel is performing a specific function, you may need multiple invocations of this kernel so that each invocation can be working on a separate piece of data. This allows us to copy data block B to the device while the first kernel invocation is working on data block A, for example. In so doing we have the opportunity to overlap the copy of data block B with the kernel processing of data block A.
The main differences with cudaMemcpyAsync (as compared to cudaMemcpy) are that:
It can be issued in any stream (it takes a stream parameter)
Normally, it returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.
Item 1 is a necessary feature so that data copy can be overlapped with kernel computation. Item 2 is a necessary feature so that data copy can be overlapped with host activity.
Although the concepts of copy/compute overlap are pretty straightforward, in practice the implementation requires some work. For additional references, please refer to:
Overlap copy/compute section of the CUDA best practices guide.
Sample code showing a basic implementation of copy/compute overlap.
Sample code showing a full multi/concurrent kernel copy/compute overlap scenario.
Note that some of the above discussion is predicated on having a compute capability 2.0 or greater device (e.g. concurrent kernels). Also, different devices may have one or 2 copy engines, meaning simultaneous copy to the device and copy from the device is only possible on certain devices.
I am working on a library that does dynamic workload distribution for the solution of a differential equation using CUDA and MPI. I have a number of nodes that each have a NVIDIA GPU. Each node also has multiple processes, of course. The equation takes a certain number of inputs (6 in this example) and builds a solution that is represented as an array in global memory on the GPU.
My current strategy is to allocate the input data buffer on the root process on each node:
if (node_info.is_node_root_process)
{
cudaMalloc(&gpu_input_buffer.u_buffer, totalsize);
cudaMalloc(&gpu_input_buffer.v_buffer, totalsize);
}
Then, I want each process to individually call cudaMemcpy to copy the input data into the GPU global memory, each to a different location in this input buffer. This way, the input buffer is continuous in memory, and it is possible to achieve memory coalescence.
I understand that calling cudaMemcpy from multiple proceses (or threads), that the calls will be executed serially on the device. This is fine.
What I want to do is share the address that e.g. gpu_input_buffer.u_buffer points to to each process. This way, each process posesses an offset process_gpu_io_offset such that the data relevant to that process is simply gpu_input_buffer.u_buffer + process_gpu_io_offset to gpu_input_buffer.u_buffer + process_gpu_io_offset + number_of_points - 1.
I have read that it is taboo to share pointer values via MPI since virtual addressing is used, but since all the GPU data resides in a single memory space and since gpu_input_buffer.u_buffer is a device pointer, I think this should be fine.
Is this a reliable way to implement what I want?
EDIT: Based on the CUDA documentation:
Any device memory pointer or event handle created by a host thread can
be directly referenced by any other thread within the same process. It
is not valid outside this process however, and therefore cannot be
directly referenced by threads belonging to a different process.
This means my original approach is invalid. As has been pointed out, the CUDA API has IPC memory handles for this purpose, but I cannot find any information about how to share this using MPI. The documentation for cudaIpcMemHandle_t is just:
CUDA IPC memory handle
which does not give any information in support of what I need to do. It is posible to create an MPI derived type and communicate that but this requires that I know the members of cudaIpcMemHandle_t, which I do not.
The CUDA Runtime API has specific support for sharing memory regions (and events) between processes on the same machine. Just use that!
Here's are example snippets (using my modern-C++ wrappers for the CUDA Runtime API)
Main process:
auto buffer = cuda::memory::device::make_unique<unsigned char[]>(totalsize);
gpu_input_buffer.u_buffer = buffer.get(); // because it's a smart pointer
auto handle_to_share = cuda::memory::ipc::export_(gpu_input_buffer.u_buffer);
do_some_MPI_magic_here_to_share_the_handle(handle_to_share);
Other processes:
auto shared_buffer_handle = do_some_MPI_magic_here_to_get_the_shared_handle();
auto full_raw_buffer = cuda::memory::ipc::import<unsigned char>(shared_buffer_handle);
auto my_part_of_the_raw_buffer = full_raw_buffer + process_gpu_io_offset;
Note: If you're very curious about exact layout of the handle type, here's an excerpt from CUDA's driver_types.h:
typedef __device_builtin__ struct __device_builtin__ cudaIpcMemHandle_st
{
char reserved[CUDA_IPC_HANDLE_SIZE];
} cudaIpcMemHandle_t;
I am using opencl 1.2 c++ wrapper for my project. I want to know what is the correct method to call my kernel. In my case, I have 2 devices and the data should be sent simultaneously to them.
I am dividing my data into two chunks and both the devices should be able to perform computations on them separately. They have no interconnection and they don't need to know what is happening in the other device.
When the data is sent to both the devices, I want to wait for the kernels to finish before my program goes further. Because I will be using results returned from both of the kernels. So I don't want to start reading the data before the kernels have returned.
I have 2 methods. Which one is programmatically correct in my case:
Method 1:
for (int i = 0; i < numberOfDevices; i++) {
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
arguments etc...);
queue[i].flush();
}
// Wait for the kernels to return.
for (int i = 0; i < numberOfDevices; i++) {
queue[i].finish();
}
Method 2:
for (int i = 0; i < numberOfDevices; i++) {
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
arguments etc...);
}
for (int i = 0; i < numberOfDevices; i++) {
queue[i].flush();
}
// Wait for the kernels to return.
for (int i = 0; i < numberOfDevices; i++) {
queue[i].finish();
}
Or none of them are correct and there is a better way to wait for my kernels to return?
Assuming each device Computes in its own memory:
I would go for multi threaded (for) loop version of your method-1. Because opencl doesnt force vendors to do asynchronous enqueuing. Nvidia for example, does synchronous enqueuing for some drivers and hardware while amd has asynchronous enqueuing.
When each device is driven by a separate thread, they should enqueue Write+Compute together before synchronising for reading partial results(second threaded loop)
Having multiple threads also advantageous for spin-wait type synchronization (clfinish) because multiple spin-wait loops are worked in parallel. This should save time in Order of a millisecond.
Flush helps some vendors like amd to start enqueueing Early.
To have correct input and correct output for all devices, only two finish commands are enough. One After Write+Compute then one After read(results). So each device get same time step data and produce results at same time step. Write and Compute doesnt need finish between them if queue type is in-order because it Computes one by one. Also this doesnt need read operations to be blocking.
Trivial finish commands always Kill performance.
Note: I already wrote a load balancer using all this, and it performs better When using event-based synchronization instead of finish. Finish is easier but has bigger synchronization times than an event based one.
Also single queue doesnt always push a gpu to its limits. Using at Least 4 queues per device ensures Latency hiding of Write and Compute on my amd system. Sometimes even 16 queues help a bit more. But for io bottlenecked situations May need even more.
Example:
thread1
Write
Compute
Synchronization with other thread
Thread2
Write
Compute
Synchronization with other thread
Thread 1
Read
Synchronization with other thread
Thread2
Read
Synchronization with other thread
Trivial synchronization Kills performance because drivers dont know your intention and they leave it as it is So you should elliminate unnecessary finish commands and convert blocking Writes to nonblocking ones where you can.
Zero synchronization is also wrong because opencl doesnt force vendors to start computing After several enqueues. It May indefinitely grow to gifabytes of memory in minutes or even seconds.
You should use Method 1. clFlush is the only way of guaranteeing that commands are issued to the device (and not just buffered somewhere before sending).
I am trying to run the following code about OpenCL. In kernel function, I will define an array int arr[1000] = {0};
kernel void test()
{
int arr[1000] = {0};
}
Then I will create N threads to run the kernel.
cl::CommandQueue cmdQueue;
cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(N), cl::NullRange); // kernel here is the one running test()
My question is, since we know that OpenCL will parallel run the threads, does it mean that, the peak memory will be N * 1000 * sizeof(int)?
This is not the way to OpenCL (yes, that's what I meant :).
The kernel function operates on kernel operands passed in from the host (CPU) - so you'd allocate your array on the host using clCreateBuffer and set the arg using clSetKernelArg. Your kernel does not declare/allocate the device memory, but simply receives it as an __global argument. Now when you run the kernel using clEnqueueNDRangeKernel, the OpenCL implementation will allocate 1000 ints and run a thread on each of those ints.
If, on the other hand you meant to allocate 1000 ints per work-item (device thread), your calculation is right (yes, they cost memory from the local pool) but it probably won't work. OpenCL work-items have access to only local memory (see here on how to check this for your device) which is severely limited.
My algorithm consists from two steps:
Data generation. On this step I generate data array in cycle as some function result
Data processing. For this step I written OpenCL kernel which process data array generated on previous step.
Now first step runs on CPU because it hard to parallelize. I want to run it on GPU because each step of generation takes some time. And I want to run second step for already generated data immediately.
Can I run another opencl kernel from currently runned kernel in separated thread? Or it be run in the some thread that caller kernel?
Some pseudocode for illustrate my point:
__kernel second(__global int * data, int index) {
//work on data[i]. This process takes a lot of time
}
__kernel first(__global int * data, const int length) {
for (int i = 0; i < length; i++) {
// generate data and store it in data[i]
// This kernel will be launched in some thread that caller or in new thread?
// If in same thread, there are ways to launch it in separated thread?
second(data, i);
}
}
No, OpenCL has no concept of threads, and neither a kernel execution can launch another kernel. All kernel execution is triggered by the CPU.
You should launch one kernel.
Then do a clFInish();
Then execute the next kernel.
There are more efficient ways but I will only mess you with events.
You just use the memory output of the first kernel as input for the second one. With that, you aboid CPU->GPU copy process.
I believe that the global work size might be considered as the number of threads that will be executed, in one way or another. Correct me if I'm wrong.