Backround:
I got a kernel called "buildlookuptable" which does some calculation and stores its result into an int array called "dense_id"
creating cl_mem object:
cl_mem dense_id = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);
Setting the kernel argument:
errWrapper("setKernel", clSetKernelArg(kernel_buildLookupTable, 5, sizeof(cl_mem), &dense_ids));
dense_ids is used in other kernels afterwards. Due to terrible memory allignment i have a huge drop in performance.
The following kernel accesses dense_id like this:
result_tuples += (dense_id[bucket+1] - dense_id[bucket]);
Execution time: 66ms
no compiler based vectorization
However if i change the line into:
result_tuples += (dense_id[bucket] - dense_id[bucket]);
Execution time: 2ms
vectorized(4) by compiler
Both kernels ran on a geforce 660ti.
So if i remove the overlapping memory access, the speed greatly increases.
Thread N accesses memory N, no overlapping.
In order to achieve correct results i would like to duplicate the cl_mem Object dense_id. So the line in the following kernel would be:
result_tuples += (dense_id1[bucket+1] - dense_id2[bucket]);
Whereas dense_id1 and dense_id2 are identic.
Another idea would be to shift the contents of dense_id1 by one element.
So the kernel line would be:
result_tuples += (dense_id1[bucket] - dense_id2[bucket]);
As dense_id is a small memory object i am sure, i could improve my execution time at the cost of memory with copying it.
Question:
After the kernel execution of "buildlookuptable" I would like to duplicate the result array dense_id on the device side.
The straight way would be using a ClEnqueueReadBuffer at host side to fetch dense_id, create a new cl_mem object and push it back to the device.
Is there a way to duplicate dense_id after "buildlookuptable" finished, without copying it to the host again?
If requested I can add more code here. I tried to only use the required parts, as I dont want to drown you in irrelevant code.
I tried the solution with the Clenqueuecopybuffer command which works as desired.
The solution to my problem ist:
clEnqueueCopyBuffer(command_queue, count_buffer, count_buffer3, 1, 0, (inCount1 + 1) * sizeof(int), NULL, NULL, NULL);
Without using another kernel it is possible to duplicate a Memory Object on Device side only.
In order to do so, you must first create another cl_mem object on host side:
cl_mem count_buffer3 = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1 + 1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);
As i had to wait for the copy to finish i used
clFinish(command_queue);
to let the program wait for its termination
As hinted by DarkZeros the performance gain was 0, because the compiler optimized the line
result_tuples += (dense_id[bucket] - dense_id[bucket]);
to 0.
Thank you for you insights so far!
Related
I'm new to CUDA and I'm trying to analyse the performance of two GPUs (RTX 3090; 48GB vRAM) in parallel. The issue I face is that for the simple block of code shown below, I would expect this overall block to complete at the same time regardless of the presence of Device 2 code, as they are running Asynchronously on different streams.
// aHost, bHost, cHost, dHost are pinned memory. All arrays are of same length.
for(int i = 0; i < 2; i++){
// ---------- Device 1 code -----------
cudaSetDevice(0);
cudaMemcpyAsync(aDest, aHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(bDest, bHost, N* sizeof(float), cudaMemcpyHostToDevice, stream1);
// ---------- Device 2 code -----------
cudaSetDevice(1);
cudaMemcpyAsync(cDest, cHost, N* sizeof(float), cudaMemcpyHostToDevice,stream2);
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
}
But alas, when I do end up running the block, running Device 1 code alone takes 80ms but adding Device 2 code to the above block adds 20ms, thus reaching 100ms as execution time for the block. I tried profiling the above code and observed the following:
Device 1 + Device 2 Concurrently (image)
When I run Device 1 alone though, I get the below profile:
Device 1 alone (image)
I can see that the initial HtoD process of Device 1 is extended in duration when I add Device 2, and I'm not sure why this is happening cause as far as I'm concerned, these processes are running independently, on different GPUs.
I realise that I haven't created any seperate CPU threads to handle seperate devices but I'm not sure if that would help. Could someone please help me understand why this elongation of duration happens when I add Device 2 code?
EDIT:
Tried profiling the code, and expected the execution durations to be independent of GPU, although I realise MemCpyAsync involves the host as well and perhaps the addition of Device 2 gives rise to more stress on the CPU as it now has to handle additional transfers...?
For CUDA, I know they are executed asynchronously after issuing the launch commands to the default stream(null stream), so how about that in OpenCL? Sample codes are as follows:
cl_context context;
cl_device_id device_id;
cl_int err;
...
cl_kernel kernel1;
cl_kernel kernel2;
cl_command_queue Q = clCreateCommandQueue(context, device_id, 0, &err);
...
size_t global_w_offset[3] = {0,0,0};
size_t global_w_size[3] = {16,16,1};
size_t local_w_size[3] = {16,16,1};
err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
clFinish(Q);
Will kernel1 and kernel2 be executed asynchronously after commands enqueued?(i.e. executions overlap)
Update
According to the OpenCL Reference, It seems set properties as CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE in clCreateCommandQueue can meet my need. But, Does out_of_order mean asynchronization?
Does out_of_order mean asynchronization
"Out of order" queue means kernels may execute in different order than they were queued (if their event/data dependencies allow it). They also may execute concurrently, but not necessarily.
Also, asynchronous execution means something else than execution overlap (that's called parallel execution or concurrency). Asynchronous execution means that kernel code on device executes independently of host code - which is always true in OpenCL.
The simple way to get concurrency (execution overlap) is by using >1 queues on the same device. This works even on implementations which don't have Out-of-order queue capability. It does not guarantee execution overlap (because OpenCL can be used on much more devices than CUDA, and on some devices you simply can't execute >1 kernel at a time), but in my experience with most GPUs you should get at least some overlap. You need to be careful about buffers used by kernels in separate queues, though.
In your current code:
err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
kernel1 finishes first and then kernel2 is executed
Using
clCreateCommandQueue(context, device_id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
you can execute multiple different kernels concurrently though it isn't guranteed.
Beware though, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported in all OpenCL implementations. This also means that you have no guarantee that kernel1 will finish execution before kernel2. If any objects that are output by kernel1 are required as input in kernel2, it may fail.
Also multiple command queues can be created and enqueued with commands and the reason for their existence is because the problem you wish to solve might involve some, if not all of the heterogeneous devices in the host. And they could represent independent streams of computation where no data is shared, or dependent streams of computation where each subsequent task depends on the previous task (often, data is shared). However, these command queues will execute on the device without synchronization, provided that no data is shared. If data is shared, then the programmer needs to ensure synchronization of the data using synchronization commands in the OpenCL specification.
My OpenCL program doesn't always finish before further host (c++) code is executed. The OpenCL code is only executed up to a certain point (which apperears to be random). The code is shortened a bit, so there may be a few things missing.
cl::Program::Sources sources;
string code = ResourceLoader::loadFile(filename);
sources.push_back({ code.c_str(),code.length() });
program = cl::Program(OpenCL::context, sources);
if (program.build({ OpenCL::default_device }) != CL_SUCCESS)
{
exit(-1);
}
queue = CommandQueue(OpenCL::context, OpenCL::default_device);
kernel = Kernel(program, "main");
Buffer b(OpenCL::context, CL_MEM_READ_WRITE, size);
queue.enqueueWriteBuffer(b, CL_TRUE, 0, size, arg);
buffers.push_back(b);
kernel.setArg(0, this->buffers[0]);
vector<Event> wait{ Event() };
Version 1:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, NULL, &wait[0]);
Version 2:
queue.enqueueNDRangeKernel(kernel, NDRange(), range, NullRange, &wait, NULL);
.
wait[0].wait();
queue.finish();
Version 1 just does not wait for the OpenCL program. Version 2 crashes the program (at queue.enqueueNDRangeKernel):
Exception thrown at 0x51D99D09 (nvopencl.dll) in foo.exe: 0xC0000005: Access violation reading location 0x0000002C.
How would one make the host wait for the GPU to finish here?
EDIT: queue.enqueueNDRangeKernel returns -1000. While it returns 0 on a rather small kernel
Version 1 says to signal wait[0] when the kernel is finished - which is the right thing to do.
Version 2 is asking your clEnqueueNDRangeKernel() to wait for the events in wait before it starts that kernel [which clearly won't work].
On it's own, queue.finish() [or clFinish()] should be enough to ensure that your kernel has completed.
Since you haven'd done clCreateUserEvent, and you haven't passed it into anything else that initializes the event, the second variant doesn't work.
It is rather bad that it crashes [it should return "invalid event" or some such - but presumably the driver you are using doesn't have a way to check that the event hasn't been initialized]. I'm reasonably sure the driver I work with will issue an error for this case - but I try to avoid getting it wrong...
I have no idea where -1000 comes from - it is neither a valid error code, nor a reasonable return value from the CL C++ wrappers. Whether the kernel is small or large [and/or completes in short or long time] shouldn't affect the return value from the enqueue, since all that SHOULD do is to enqueue the work [with no guarantee that it starts until a queue.flush() or clFlush is performed]. Waiting for it to finish should happen elsewhere.
I do most of my work via the raw OpenCL API, not the C++ wrappers, which is why I'm referring to what they do, rather than the C++ wrappers.
I faced a similar problem with OpenCL that some packages of a data stream we're not processed by OpenCL.
I realized it just happens while the notebook is plugged into a docking station.
Maybe this helps s.o.
(No clFlush or clFinish calls)
I'm getting the openCL error CL_INVALID_WORK_GROUP_SIZE with a local work size of 512. The program works with lower powers of 2, so I'm assuming the error cause is exceeding of CL_DEVICE_MAX_WORK_GROUP_SIZE.
Is there a way to query openCL for that exact value?
You can query the device's maximum work-group size like this:
size_t maxWorkGroupSize;
clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE,
sizeof(size_t), &maxWorkGroupSize, NULL);
Note that a specific kernel might have a different (lower) maximum, which you can query like this:
size_t maxWorkGroupSize;
clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE,
sizeof(size_t), &maxWorkGroupSize, NULL);
I have multiple kernels,in the first of them i send some entries, the output I have from the first kernel is the input for the next. My queue of kernels repeat this behavior 8 times until the last kernel that sends me the real output what I need.
This is an example of what i did:
cl::Kernel kernel1 = cl::Kernel(OPETCL::program, "forward");
//agrego los argumetnos del kernel
kernel1.setArg(0, cl_rowCol);
kernel1.setArg(1, cl_data);
kernel1.setArg(2, cl_x);
kernel1.setArg(3, cl_b);
kernel1.setArg(4, sizeof(int), &largo);
//ejecuto el kernel
OPETCL::queue.enqueueNDRangeKernel(kernel1, cl::NullRange, global, local, NULL, &profilingApp);
/********************************/
/** ejecuto las simetrias de X **/
/********************************/
cl::Kernel kernel2 = cl::Kernel(OPETCL::program, "forward_symmX");
//agrego los argumetnos del kernel
kernel2.setArg(0, cl_rowCol);
kernel2.setArg(1, cl_data);
kernel2.setArg(2, cl_x);
kernel2.setArg(3, cl_b);
kernel2.setArg(4, cl_symmLOR_X);
kernel2.setArg(5, cl_symm_Xpixel);
kernel2.setArg(6, sizeof(int), &largo);
//ejecuto el kernel
OPETCL::queue.enqueueNDRangeKernel(kernel2, cl::NullRange, global, local, NULL, &profilingApp);
OPETCL::queue.finish();
OPETCL::queue.enqueueReadBuffer(cl_b, CL_TRUE, 0, sizeof(float) * lors, b, NULL, NULL);
In this case cl_b is the output what i need.
My question is if the arguments i send to kernels are the same to all kernel, but only one is different.
Is correct what i did to set arguments??
The arguments are keeping in the device during the all kernels execution??
Since you are using the same queue and OpenCL-context this is OK and your kernels can use the data (arguments) calculated by previous kernel and the data will be kept on the device.
I suggest you to use clFinish after each kernel execution to assure the previous kernel finished the calculation, before next one starts. Alternatively, you can use events, to assure that.
I think you get this behaviour for free, as long as you don't specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when you create your command queue.
It looks like you're doing it correctly. In general, this is the process:
create your buffer(s)
queue a buffer copy to the device
queue the kernel execution
repeat #3 for as many kernels as you need to run, passing the buffer as the correct parameter. Use setArg to change/add params. The buffer will still exist on the device -- and modified by the previous kernels
queue a copy of the buffer back to the host
If you do specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, you will have to use events to control the execution order of the kernels. This seems unnecessary for your example though.