Send same data to multiples kernel in OpenCL - c++

I have multiple kernels,in the first of them i send some entries, the output I have from the first kernel is the input for the next. My queue of kernels repeat this behavior 8 times until the last kernel that sends me the real output what I need.
This is an example of what i did:
cl::Kernel kernel1 = cl::Kernel(OPETCL::program, "forward");
//agrego los argumetnos del kernel
kernel1.setArg(0, cl_rowCol);
kernel1.setArg(1, cl_data);
kernel1.setArg(2, cl_x);
kernel1.setArg(3, cl_b);
kernel1.setArg(4, sizeof(int), &largo);
//ejecuto el kernel
OPETCL::queue.enqueueNDRangeKernel(kernel1, cl::NullRange, global, local, NULL, &profilingApp);
/********************************/
/** ejecuto las simetrias de X **/
/********************************/
cl::Kernel kernel2 = cl::Kernel(OPETCL::program, "forward_symmX");
//agrego los argumetnos del kernel
kernel2.setArg(0, cl_rowCol);
kernel2.setArg(1, cl_data);
kernel2.setArg(2, cl_x);
kernel2.setArg(3, cl_b);
kernel2.setArg(4, cl_symmLOR_X);
kernel2.setArg(5, cl_symm_Xpixel);
kernel2.setArg(6, sizeof(int), &largo);
//ejecuto el kernel
OPETCL::queue.enqueueNDRangeKernel(kernel2, cl::NullRange, global, local, NULL, &profilingApp);
OPETCL::queue.finish();
OPETCL::queue.enqueueReadBuffer(cl_b, CL_TRUE, 0, sizeof(float) * lors, b, NULL, NULL);
In this case cl_b is the output what i need.
My question is if the arguments i send to kernels are the same to all kernel, but only one is different.
Is correct what i did to set arguments??
The arguments are keeping in the device during the all kernels execution??

Since you are using the same queue and OpenCL-context this is OK and your kernels can use the data (arguments) calculated by previous kernel and the data will be kept on the device.
I suggest you to use clFinish after each kernel execution to assure the previous kernel finished the calculation, before next one starts. Alternatively, you can use events, to assure that.

I think you get this behaviour for free, as long as you don't specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when you create your command queue.
It looks like you're doing it correctly. In general, this is the process:
create your buffer(s)
queue a buffer copy to the device
queue the kernel execution
repeat #3 for as many kernels as you need to run, passing the buffer as the correct parameter. Use setArg to change/add params. The buffer will still exist on the device -- and modified by the previous kernels
queue a copy of the buffer back to the host
If you do specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, you will have to use events to control the execution order of the kernels. This seems unnecessary for your example though.

Related

What does this sentence in the OpenMPI documentation mean?

In the documentation of OpenMPI one can read the following sentence in the "When communicator is an Inter-communicator"-section:
The send buffer argument of the processes in the first group must be consistent with the receive buffer argument of the root process in the second group.
This section only appears in the documentation of non-blocking functions. In my case this is MPI_Igatherv.
I have an Inter-communicator connecting two groups. The first group contains only one process, which is a master (distributing and collecting data). The second group contains one or more worker processes (receiving data, doing work and sending results back). All the workers have the same code and the master has its own separate code. The master starts the workers with MPI_Spawn.
However I am concerned, I am not using the function call correctly.
As the master tries to receive data, I use the following code:
MPI_Igatherv(nullptr, 0, MPI_DOUBLE, recv_buf, sizes, offsets, MPI_DOUBLE, MPI_ROOT, inter_comm, &mpi_request);
The master does not contribute any data, so the send buffer here is a nullptr with zero size.
On the other hand, all workers send data like this:
MPI_Igatherv(send_buf, size, MPI_DOUBLE, nullptr, nullptr, nullptr, MPI_DOUBLE, 0, inter_comm, &mpi_request);
The workers do not receive any data, so the receive buffer is a nullptr with no sizes or offsets.
Is this the correct way?

Which way to synchronize vkQueueSubmit() to use?

I have a function that copies data from one buffer to another, I need to synchronize its execution.
I have such a bad option:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
//Waiting for completion
vkQueueWaitIdle(graphicsQueue);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
}
This option is bad because if I want to execute the copyBuffer() function several times, then all the buffers will be copied strictly one at a time.
I want to use a fence for each function call so that multiple calls can run in parallel.
So far, I have only such a solution:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Create fence
VkFenceCreateInfo fenceInfo{};
fenceInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
fenceInfo.flags = VK_FENCE_CREATE_SIGNALED_BIT;
VkFence executionCompleteFence = VK_NULL_HANDLE;
if (vkCreateFence(logicalDevice, &fenceInfo, VK_NULL_HANDLE, &executionCompleteFence) != VK_SUCCESS) {
throw MakeErrorInfo("Failed to create fence");
}
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
vkWaitForFences(logicalDevice, 1, &executionCompleteFence, VK_TRUE, UINT64_MAX);
vkResetFences(logicalDevice, 1, &executionCompleteFence);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
vkDestroyFence(logicalDevice, executionCompleteFence, VK_NULL_HANDLE);
}
Which of these options is better?
Is the second option written correctly?
Both functions are bad in the same way. They both block the CPU from doing anything until the transfer is done. And they both could be used to potentially submit multiple CBs to the same queue in the same frame, but with different submit commands.
Neither is desirable if performance is something you care about.
Ultimately, what you need to do is have your copyBuffer function not actually perform the copy. You should have a function which builds a command buffer to do a copy. That CB is then stored in a place to be submitted later with other copying CBs. Or better yet, you can have just one copying CB that each command adds to (the first one called in a frame will create the CB).
At some point in the future, before you've submitted the work that will use this data, you need to submit the transfer operations. And the way this works depends on if you're submitting the transfer operations on the same queue as the work that will consume them or not.
If they're on the same queue, then all you need to do is have an event in a command buffer at the end of your batch that synchronizes the transfer operations with their receivers. If you want to be more clever, each transfer operation can have its own event, which the receiving operations will wait on.
And in same-queue transfers, you also want to make sure that you submit the transfers in the same vkQueueSubmit call as the rest of your work. Or to put it another way, you should never make more than one call to vkQueueSubmit for a particular queue in a particular frame.
If you're dealing with separate queues, then things change. A bit. If timeline semaphores aren't an option, you'll need to submit your transfer work before you submit the receiving operations. This is because the transfer batch will need to signal a semaphore that the receiving operation will wait on. And a binary semaphore cannot be waited on until the operation that signals it has been submitted to a queue.
But otherwise, everything else stays the same. Of course, you don't need events since you're synchronizing by semaphore.
The two functions are semantically identical and do exactly the same blocking behavior.
The second is slightly better. vkQueueWaitIdle is kind of a debug and out-of-hotspot feature. It might incur a hidden second submit to signal the implicit fence.
You don't need to reset fence that you subsequently destroy anyway. And you are creating it presignaled, which is a bug. Also you forgot to pass it to the vkQueueSubmit.

Are OpenCL kernels executed asynchronously?

For CUDA, I know they are executed asynchronously after issuing the launch commands to the default stream(null stream), so how about that in OpenCL? Sample codes are as follows:
cl_context context;
cl_device_id device_id;
cl_int err;
...
cl_kernel kernel1;
cl_kernel kernel2;
cl_command_queue Q = clCreateCommandQueue(context, device_id, 0, &err);
...
size_t global_w_offset[3] = {0,0,0};
size_t global_w_size[3] = {16,16,1};
size_t local_w_size[3] = {16,16,1};
err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
clFinish(Q);
Will kernel1 and kernel2 be executed asynchronously after commands enqueued?(i.e. executions overlap)
Update
According to the OpenCL Reference, It seems set properties as CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE in clCreateCommandQueue can meet my need. But, Does out_of_order mean asynchronization?
Does out_of_order mean asynchronization
"Out of order" queue means kernels may execute in different order than they were queued (if their event/data dependencies allow it). They also may execute concurrently, but not necessarily.
Also, asynchronous execution means something else than execution overlap (that's called parallel execution or concurrency). Asynchronous execution means that kernel code on device executes independently of host code - which is always true in OpenCL.
The simple way to get concurrency (execution overlap) is by using >1 queues on the same device. This works even on implementations which don't have Out-of-order queue capability. It does not guarantee execution overlap (because OpenCL can be used on much more devices than CUDA, and on some devices you simply can't execute >1 kernel at a time), but in my experience with most GPUs you should get at least some overlap. You need to be careful about buffers used by kernels in separate queues, though.
In your current code:
err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
kernel1 finishes first and then kernel2 is executed
Using
clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
you can execute multiple different kernels concurrently though it isn't guranteed.
Beware though, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported in all OpenCL implementations. This also means that you have no guarantee that kernel1 will finish execution before kernel2. If any objects that are output by kernel1 are required as input in kernel2, it may fail.
Also multiple command queues can be created and enqueued with commands and the reason for their existence is because the problem you wish to solve might involve some, if not all of the heterogeneous devices in the host. And they could represent independent streams of computation where no data is shared, or dependent streams of computation where each subsequent task depends on the previous task (often, data is shared). However, these command queues will execute on the device without synchronization, provided that no data is shared. If data is shared, then the programmer needs to ensure synchronization of the data using synchronization commands in the OpenCL specification.

Wait for kernel to finish OpenCL

My OpenCL program doesn't always finish before further host (c++) code is executed. The OpenCL code is only executed up to a certain point (which apperears to be random). The code is shortened a bit, so there may be a few things missing.
cl::Program::Sources sources;
string code = ResourceLoader::loadFile(filename);
sources.push_back({ code.c_str(),code.length() });
program = cl::Program(OpenCL::context, sources);
if (program.build({ OpenCL::default_device }) != CL_SUCCESS)
{
exit(-1);
}
queue = CommandQueue(OpenCL::context, OpenCL::default_device);
kernel = Kernel(program, "main");
Buffer b(OpenCL::context, CL_MEM_READ_WRITE, size);
queue.enqueueWriteBuffer(b, CL_TRUE, 0, size, arg);
buffers.push_back(b);
kernel.setArg(0, this->buffers[0]);
vector<Event> wait{ Event() };
Version 1:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, NULL, &wait[0]);
Version 2:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, &wait, NULL);
.
wait[0].wait();
queue.finish();
Version 1 just does not wait for the OpenCL program. Version 2 crashes the program (at queue.enqueueNDRangeKernel):
Exception thrown at 0x51D99D09 (nvopencl.dll) in foo.exe: 0xC0000005: Access violation reading location 0x0000002C.
How would one make the host wait for the GPU to finish here?
EDIT: queue.enqueueNDRangeKernel returns -1000. While it returns 0 on a rather small kernel
Version 1 says to signal wait[0] when the kernel is finished - which is the right thing to do.
Version 2 is asking your clEnqueueNDRangeKernel() to wait for the events in wait before it starts that kernel [which clearly won't work].
On it's own, queue.finish() [or clFinish()] should be enough to ensure that your kernel has completed.
Since you haven'd done clCreateUserEvent, and you haven't passed it into anything else that initializes the event, the second variant doesn't work.
It is rather bad that it crashes [it should return "invalid event" or some such - but presumably the driver you are using doesn't have a way to check that the event hasn't been initialized]. I'm reasonably sure the driver I work with will issue an error for this case - but I try to avoid getting it wrong...
I have no idea where -1000 comes from - it is neither a valid error code, nor a reasonable return value from the CL C++ wrappers. Whether the kernel is small or large [and/or completes in short or long time] shouldn't affect the return value from the enqueue, since all that SHOULD do is to enqueue the work [with no guarantee that it starts until a queue.flush() or clFlush is performed]. Waiting for it to finish should happen elsewhere.
I do most of my work via the raw OpenCL API, not the C++ wrappers, which is why I'm referring to what they do, rather than the C++ wrappers.
I faced a similar problem with OpenCL that some packages of a data stream we're not processed by OpenCL.
I realized it just happens while the notebook is plugged into a docking station.
Maybe this helps s.o.
(No clFlush or clFinish calls)

opencl duplicate memory object on device

Backround:
I got a kernel called "buildlookuptable" which does some calculation and stores its result into an int array called "dense_id"
creating cl_mem object:
cl_mem dense_id = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);
Setting the kernel argument:
errWrapper("setKernel", clSetKernelArg(kernel_buildLookupTable, 5, sizeof(cl_mem), &dense_ids));
dense_ids is used in other kernels afterwards. Due to terrible memory allignment i have a huge drop in performance.
The following kernel accesses dense_id like this:
result_tuples += (dense_id[bucket+1] - dense_id[bucket]);
Execution time: 66ms
no compiler based vectorization
However if i change the line into:
result_tuples += (dense_id[bucket] - dense_id[bucket]);
Execution time: 2ms
vectorized(4) by compiler
Both kernels ran on a geforce 660ti.
So if i remove the overlapping memory access, the speed greatly increases.
Thread N accesses memory N, no overlapping.
In order to achieve correct results i would like to duplicate the cl_mem Object dense_id. So the line in the following kernel would be:
result_tuples += (dense_id1[bucket+1] - dense_id2[bucket]);
Whereas dense_id1 and dense_id2 are identic.
Another idea would be to shift the contents of dense_id1 by one element.
So the kernel line would be:
result_tuples += (dense_id1[bucket] - dense_id2[bucket]);
As dense_id is a small memory object i am sure, i could improve my execution time at the cost of memory with copying it.
Question:
After the kernel execution of "buildlookuptable" I would like to duplicate the result array dense_id on the device side.
The straight way would be using a ClEnqueueReadBuffer at host side to fetch dense_id, create a new cl_mem object and push it back to the device.
Is there a way to duplicate dense_id after "buildlookuptable" finished, without copying it to the host again?
If requested I can add more code here. I tried to only use the required parts, as I dont want to drown you in irrelevant code.
I tried the solution with the Clenqueuecopybuffer command which works as desired.
The solution to my problem ist:
clEnqueueCopyBuffer(command_queue, count_buffer, count_buffer3, 1, 0, (inCount1 + 1) * sizeof(int), NULL, NULL, NULL);
Without using another kernel it is possible to duplicate a Memory Object on Device side only.
In order to do so, you must first create another cl_mem object on host side:
cl_mem count_buffer3 = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1 + 1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);
As i had to wait for the copy to finish i used
clFinish(command_queue);
to let the program wait for its termination
As hinted by DarkZeros the performance gain was 0, because the compiler optimized the line
result_tuples += (dense_id[bucket] - dense_id[bucket]);
to 0.
Thank you for you insights so far!