I met some problems When using functions in thrust library, I am not sure if I should add cudaDeviceSynchronize manually before it. For example,
double dt = 0;
kernel_1<<<blocks, threads>>>(it);
dt = *(thrust::max_element(it, it + 10));
printf("%f\n", dt);
Since kernel_1 is non-blocking, host will execute the next statement. The problem is I am not sure if the thrust::max_element is blocking. If it is blocking, then it works well; otherwise, will host skip it and execute the "printf" statement?
Thanks
Your code is broken in at least 2 ways.
it is presumably a device pointer:
kernel_1<<<blocks, threads>>>(it);
^^
it is not allowed to use a raw device pointer as an argument to a thrust algorithm:
dt = *(thrust::max_element(it, it + 10));
^^
unless you wrap that pointer in a thrust::device_ptr or else use the thrust::device execution policy explicitly as an argument to the algorithm. Without any of these clues, thrust will dispatch the host code path (which will probably seg fault) as discussed in the thrust quick start guide.
If you fixed the above item using either thrust::device_ptr or thrust::device, then thrust::max_element will return an iterator of a type consistent with the iterators passed to it. If you pass a thrust::device_ptr it will return a thrust::device_ptr. If you use thrust::device with your raw pointer, it will return a raw pointer. In either case, it is illegal to dereference such in host code:
dt = *(thrust::max_element(it, it + 10));
^
again, I would expect such usage to seg fault.
Regarding asynchrony, it is safe to assume that all thrust algorithms that return a scalar quantity stored in stack variable are synchronous. That means the CPU thread will not proceed beyond the thrust call until the stack variable has been populated with the correct value
Regarding GPU activity in general, unless you use streams, all GPU activity is issued to the same (default) stream. This means that all CUDA activity will be executed in-order, and a given CUDA operation will not begin until the preceding CUDA activity is complete. Therefore, even though your kernel launch is asynchronous, and the CPU thread will proceed onto the thrust::max_element call, any CUDA activity spawned from that call will not begin executing until the previous kernel launch is complete. Therefore, any changes made to the data referenced by it by kernel_1 should be finished and completely valid before any CUDA processing in thrust::max_element begins. And as we've seen, thrust::max_element itself will insert synchronization.
So once you fix the defects in your code, there should not be any requirement to insert additional synchronization anywhere.
This function does not seem to be async.
Both of these pages explain the behaviour of max_element() and they do not explicit it as async, so I would assume it is blocking :
thrust : Extrema
max_element
Since it is using an iterator to treat all the elements and find the maximum of the values, I can not think about it to be async.
You can still use cudaDeviceSynchronize to try it for real, but do not forget to set the corresponding flag on your device.
Related
I have some questions.
Recently I'm making a program by using CUDA.
In my program, there is one big data on Host programmed with std::map(string, vector(int)).
By using these datas some vector(int) are copied to GPUs global memory and processed on GPU
After processing, some results are generated on GPU and these results are copied to CPU.
These are all my program schedule.
cudaMemcpy( ... , cudaMemcpyHostToDevice)
kernel function(kernel function only can be done when necessary data is copied to GPU global memory)
cudaMemcpy( ... , cudaMemcpyDeviceToHost)
repeat 1~3steps 1000times (for another data(vector) )
But I want to reduce processing time.
So I decided to use cudaMemcpyAsync function in my program.
After searching some documents and web pages, I realize that to use cudaMemcpyAsync function host memory which has data to be copied to GPUs global memory must be allocated as pinned memory.
But my programs are using std::map, so I couldn't make this std::map data to pinned memory.
So instead of using this, I made a buffer array typed pinned memory and this buffer can always handle all the case of copying vector.
Finally, my program worked like this.
Memcpy (copy data from std::map to buffer using loop until whole data is copied to buffer)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
kernel(kernel function only can be executed when whole data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~4steps 1000times (for another data(vector) )
And my program became much faster than the previous case.
But problem(my curiosity) is at this point.
I tried to make another program in a similar way.
Memcpy (copy data from std::map to buffer only for one vector)
cudaMemcpyAsync( ... , cudaMemcpyHostToDevice)
loop 1~2 until whole data is copied to GPU global memory
kernel(kernel function only can be executed when necessary data is copied to GPU global memory)
cudaMemcpyAsync( ... , cudaMemcpyDeviceToHost)
repeat 1~5steps 1000times (for another data(vector) )
This method came out to be about 10% faster than the method discussed above.
But I don't know why.
I think cudaMemcpyAsync only can be overlapped with kernel function.
But my case I think it is not. Rather than it looks like can be overlapped between cudaMemcpyAsync functions.
Sorry for my long question but I really want to know why.
Can Someone teach or explain to me what is the exact facility "cudaMemcpyAsync" and what functions can be overlapped with "cudaMemcpyAsync" ?
The copying activity of cudaMemcpyAsync (as well as kernel activity) can be overlapped with any host code. Furthermore, data copy to and from the device (via cudaMemcpyAsync) can be overlapped with kernel activity. All 3 activities: host activity, data copy activity, and kernel activity, can be done asynchronously to each other, and can overlap each other.
As you have seen and demonstrated, host activity and data copy or kernel activity can be overlapped with each other in a relatively straightforward fashion: kernel launches return immediately to the host, as does cudaMemcpyAsync. However, to get best overlap opportunities between data copy and kernel activity, it's necessary to use some additional concepts. For best overlap opportunities, we need:
Host memory buffers that are pinned, e.g. via cudaHostAlloc()
Usage of cuda streams to separate various types of activity (data copy and kernel computation)
Usage of cudaMemcpyAsync (instead of cudaMemcpy)
Naturally your work also needs to be broken up in a separable way. This normally means that if your kernel is performing a specific function, you may need multiple invocations of this kernel so that each invocation can be working on a separate piece of data. This allows us to copy data block B to the device while the first kernel invocation is working on data block A, for example. In so doing we have the opportunity to overlap the copy of data block B with the kernel processing of data block A.
The main differences with cudaMemcpyAsync (as compared to cudaMemcpy) are that:
It can be issued in any stream (it takes a stream parameter)
Normally, it returns control to the host immediately (just like a kernel call does) rather than waiting for the data copy to be completed.
Item 1 is a necessary feature so that data copy can be overlapped with kernel computation. Item 2 is a necessary feature so that data copy can be overlapped with host activity.
Although the concepts of copy/compute overlap are pretty straightforward, in practice the implementation requires some work. For additional references, please refer to:
Overlap copy/compute section of the CUDA best practices guide.
Sample code showing a basic implementation of copy/compute overlap.
Sample code showing a full multi/concurrent kernel copy/compute overlap scenario.
Note that some of the above discussion is predicated on having a compute capability 2.0 or greater device (e.g. concurrent kernels). Also, different devices may have one or 2 copy engines, meaning simultaneous copy to the device and copy from the device is only possible on certain devices.
I use thrust::copy to transfer data from device to host in a multi-GPU system. Each GPU has a equally sized partition of the data. Using OpenMP, I call the function on each device. On my current system I am working on 4 GPUs.
#pragma omp parallel for
for (size_t i = 0; i < devices.size(); ++i)
{
const int device = devices[i];
thrust::copy(thrust::device, // execution policy
device_buffers->At(device)->begin(), // thrust::device_vector
device_buffers->At(device)->end(),
elements->begin() + (device * block_size)); // thrust::host_vector
}
After reading the documentation and the following post, I understand that the default thrust::execution_policy is chosen based on the iterators that are passed.
When copying data from device to host, both iterators are passed as
function parameters.
1. Which execution policy is picked here per default? thrust::host or thrust::device?
After doing some benchmarks, I observe that passing thrust::device
explicitly improves performance, compared to not passing an explicit
parameter.
2. What could be the reason for the performance gain? The system is a POWER9 machine. How does thrust::copy and the specific execution
policy work internally? How many of the 4 copy engines of each
device are actually used?
However, nvprof does not display the [CUDA memcpy DtoH] category
anymore and instead shows void thrust::cuda_cub::core [...]
__parallel_for::ParallelForAgent [...] which even shows an increase in Time (s). This does not make sense because, as I said, I observed
a consistent performance improvement (smaller total execution time)
when using thrust::device.
3. Is this just a nvprof + thrust-specific behaviour that causes profiling numbers not to correlate with acutal execution time? I
observed something similiar for cudaFree: It seems that cudaFree is
returning control to the host code pretty fast which results in small
execution time while nvprof shows much higher numbers because the
actual deallocation probably happens in lazy fashion.
The Thrust doc on the thrust::device states the following:
Raw pointers allocated by host APIs should not be mixed with a thrust::device algorithm invocation when the device backend is CUDA
To my understanding, this means that host-device copy with thrust::device execution policy is invalid, in the first place, unless the host memory is pinned.
We imply that your host allocation is not pinned, BUT: One possibility is that on POWER9 with NVLINK you may be lucky that any host-allocated memory is addressable from within the GPU. Thanks to that, host-device copy with thrust::device works, though it should not.
On a regular system, host memory is addressable from within a GPU only if this host memory is allocated with cudaMallocHost (pinned). So, the question is whether your POWER system has automagically upgraded all allocations to be pinned. Is the observed performance bonus due to the implicitly-pinned memory, or would you get an additional speedup, if allocations are also done with cudaMallocHost explicitly?
Another Thrust design-based evidence is that thrust::device policy has par.on(stream) support, while thrust::host does not. This is pretty much aligned with the fact that asynchronous host-device copies are only possible with the pinned memory.
A very simplified version of my code looks like:
do {
//reset loop variable b to 0/false
b = 0;
// execute kernel
kernel<<<...>>>(b);
// use the value of b for while condition
} while(b);
Boolean variable b can be set to true by any thread in kernel and it tells us whether we continue running our loop.
Using cudaMalloc, cudaMemset, and cudaMemcpy we can create/set/copy device memory to implement this. However I just found the existence of pinned memory. Using cudaMalloHost to allocate b and a call to cudaDeviceSynchronize right after the kernel gave quite a speed up (~50%) in a simple test program.
Is pinned memory the best option for this boolean variable b or is there a better option?
You haven't shown your initial code and the modified code therefore nobody can have any idea about the details of the improvement you are stating in your post.
The answer to your question varies depending on
The b is read and written or is only written inside the GPU kernel. Reads might need to fetch the actual value directly from the host side if b is not found in the cache resulting in latencies. On the other hand, the latency for writes can be covered if there are further operations that can keep the threads busy.
How frequent you modify the value. If you access the value frequently in your program, the GPU probably can keep the variable inside L2 avoiding host side accesses.
The frequency of memory operations between accesses to b. If there are many memory transactions between accesses to b, it is more probable that b in the cache is replaced with some other content. As a result, when accessed again, b could not be found in the cache and a time-consuming host-access is necessary.
In cases having b in the host side causes many host memory transactions, it is logical to keep it inside the GPU global memory and transfer it back at the end of each loop iteration. You can do it rather fast with an asynchronous copy in the same stream as kernel's and synchronize with the host right after.
All above items are for cache-enabled devices. If your device is pr-Fermi (CC<2.0), the story is different.
I have a queue with elements which needs to be processed. I want to process these elements in parallel. The will be some sections on each element which need to be synchronized. At any point in time there can be max num_threads running threads.
I'll provide a template to give you an idea of what I want to achieve.
queue q
process_element(e)
{
lock()
some synchronized area
// a matrix access performed here so a spin lock would do
unlock()
...
unsynchronized area
...
if( condition )
{
new_element = generate_new_element()
q.push(new_element) // synchonized access to queue
}
}
process_queue()
{
while( elements in q ) // algorithm is finished condition
{
e = get_elem_from_queue(q) // synchronized access to queue
process_element(e)
}
}
I can use
pthreads
openmp
intel thread building blocks
Top problems I have
Make sure that at any point in time I have max num_threads running threads
Lightweight synchronization methods to use on queue
My plan is to the intel tbb concurrent_queue for the queue container. But then, will I be able to use pthreads functions ( mutexes, conditions )? Let's assume this works ( it should ). Then, how can I use pthreads to have max num_threads at one point in time? I was thinking to create the threads once, and then, after one element is processes, to access the queue and get the next element. However it if more complicated because I have no guarantee that if there is not element in queue the algorithm is finished.
My question
Before I start implementing I'd like to know if there is an easy way to use intel tbb or pthreads to obtain the behaviour I want? More precisely processing elements from a queue in parallel
Note: I have tried to use tasks but with no success.
First off, pthreads gives you portability which is hard to walk away from. The following appear to be true from your question - let us know if these aren't true because the answer will then change:
1) You have a multi-core processor(s) on which you're running the code
2) You want to have no more than num_threads threads because of (1)
Assuming the above to be true, the following approach might work well for you:
Create num_threads pthreads using pthread_create
Optionally, bind each thread to a different core
q.push(new_element) atomically adds new_element to a queue. pthreads_mutex_lock and pthreads_mutex_unlock can help you here. Examples here: http://pages.cs.wisc.edu/~travitch/pthreads_primer.html
Use pthreads_mutexes for dequeueing elements
Termination is tricky - one way to do this is to add a TERMINATE element to the queue, which upon dequeueing, causes the dequeuer to queue up another TERMINATE element (for the next dequeuer) and then terminate. You will end up with one extra TERMINATE element in the queue, which you can remove by having a named thread dequeue it after all the threads are done.
Depending on how often you add/remove elements from the queue, you may want to use something lighter weight than pthread_mutex_... to enqueue/dequeue elements. This is where you might want to use a more machine-specific construct.
TBB is compatible with other threading packages.
TBB also emphasizes scalability. So when you port over your program to from a dual core to a quad core you do not have to adjust your program. With data parallel programming, program performance increases (scales) as you add processors.
Cilk Plus is also another runtime that provides good results.
www.cilkplus.org
Since pThreads is a low level theading library you have to decide how much control you need in your application because it does offer flexibility, but at a high cost in terms of programmer effort, debugging time, and maintenance costs.
My recommendation is to look at tbb::parallel_do. It was designed to process elements from a container in parallel, even if the container itself is not concurrent; i.e. parallel_do works with an std::queue correctly without any user synchronization (of course you would still need to protect your matrix access inside process_element(). Moreover, with parallel_do you can add more work on the fly, which looks like what you need, as process_element() creates and adds new elements to the work queue (the only caution is that the newly added work will be processed immediately, unlike putting in a queue which would postpone processing till after all "older" items). Also, you don't have to worry about termination: parallel_do will complete automatically as soon as all initial queue items and new items created on the fly are processed.
However, if, besides the computation itself, the work queue can be concurrently fed from another source (e.g. from an I/O processing thread), then parallel_do is not suitable. In this case, it might make sense to look at parallel_pipeline or, better, the TBB flow graph.
Lastly, an application can control the number of active threads with TBB, though it's not a recommended approach.
i have a kernel launched several times, untill a solution is found. the solution will be found by at least one block.
therefore when a block finds the solution it should inform the cpu that the solution is found, so the cpu prints the solution provided by this block.
so what i am currently doing is the following:
__global__ kernel(int sol)
{
//do some computations
if(the block found a solution)
sol = blockId.x //atomically
}
now on every call to the kernel i copy sol back to the host memory and check its value. if its set to 3 for example, i know that blockid 3 found the solution so i now know where the index of the solution start, and copy the solution back to the host.
in this case, will using cudaHostAlloc be a better option? more over would copying the value of a single integer on every kernel call slows down my program?
Issuing a copy from GPU to CPU and then waiting for its completion will slow your program a bit. Note that if you choose to send 1 byte or 1KB, that won't make much of a difference. In this case bandwidth is not a problem, but latency.
But launching a kernel does consume some time as well. If the "meat" of your algorithm is in the kernel itself I wouldn't spend too much time on that single, small transfer.
Do note, if you choose to use the mapped memory, instead of using cudaMemcpy, you will need to explicitly put a cudaDeviceSynchronise (or cudaThreadSynchronise with older CUDA) barrier (as opposed to an implicit barrier at cudaMemcpy) before reading the status. Otherwise, your host code may go achead reading an old value stored in your pinned memory, before the kernel overwrites it.