OpenCL multiple command queue for Concurrent NDKernal Launch

OpenCL multiple command queue for Concurrent NDKernal Launch - concurrency

I m trying to run an application of vector addition, where i need to launch multiple kernels concurrently,
so for concurrent kernel launch someone in my last question advised me to use multiple command queues.
which i m defining by an array
context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &err);
for(i=0;i<num_ker;++i)
{
queue[i] = clCreateCommandQueue(context, device_id, 0, &err);
}
I m getting an error "command terminated by signal 11" some where around the above code.
i m using for loop for launching kernels and En-queue data too
for(i=0;i<num_ker;++i)
{
err = clEnqueueNDRangeKernel(queue[i], kernel, 1, NULL, &globalSize, &localSize,
0, NULL, NULL);
}
The thing is I m not sure where m i going wrong, i saw somewhere that we can make array of command queues, so thats why i m using an array.
another information, when i m not using A for loop, just manually defining multiple command queues, it works fine.

I read as well your last question, and I think you should first rethink what do you really want to do and if OpenCL is really the way of doing it.
OpenCL is an API for masive parallel processing and data crunching.
Where each kernel (or queued task) operates parallelly on many data
values at the same time, therefore outperforming any serial CPU processing by many orders of magnitude.
The typical use case for OpenCL is 1 kernel running millions of work items.
Were more advance applications may need multiple sequences of different kernels, and special syncronizations between CPU and GPU.
But concurrency is never a requirement. (Otherwise, a single core CPU would not be able to perform the task, and thats never the case. It will be slower, ok, but it will still be possible to run it)
Even if 2 tasks need to run at the same time. The time taken will be the same concurrently or not:
Not concurrent case:
Kernel 1: *
Kernel 2: -
GPU Core 1: *****-----
GPU Core 2: *****-----
GPU Core 3: *****-----
GPU Core 4: *****-----
Concurrent case:
Kernel 1: *
Kernel 2: -
GPU Core 1: **********
GPU Core 2: **********
GPU Core 3: ----------
GPU Core 4: ----------
In fact, the non concurrent case is preferred, since at least the first task is already completed and further processing can continue.
What you do want to do, as far as I understand, is run multiple kernels at the same time. So that the kernels run fully concurrently. For example, run 100 kernels (same kernel or different) and run them at the same time.
That does not fit the OpenCL model at all. And in fact in may be way slower than a CPU single thread.
If each kernel is independent to all the others, a core (SIMD or CPU) can only be allocated for 1 kernel at a time (because they only have 1 PC), even though they could run 1k threads at the same time. In an ideal scenario, this will convert your OpenCL device in a pool of few cores (6-10) that consume serially the kernels queued. And that is supposing the API supports it and the device as well, what is not always the case. In the worst case you will have a single device that runs a single kernel and is 99% wasted.
Examples of stuff that can be done in OpenCL:
Data crunching/processing. Multiply vectors, simulate particles, etc..
Image processing, border detection, filtering, etc.
Video compresion, edition, generation
Raytracing, complex light math, etc.
Sorting
Examples of stuff that are not suitable for OpenCL:
Atending async request (HTTP, trafic, interactive data)
Procesing low amounts of data
Procesing data that need completely different procesing for each type of it
From my point of view, the only real use case of using multiple kernels is the latter, and no matter what you do the performance will be horrible in that case.
Better use a multithread pool instead.

Related

Multithreading in Direct 3D 12

Hi I am a newbie learning Direct 3D 12.
So far, I understood that Direct 3D 12 is designed for multithreading and I'm trying to make my own simple multithread demo by following the tutorial by braynzarsoft.
https://www.braynzarsoft.net/viewtutorial/q16390-03-initializing-directx-12
Environment is windows, using C++, Visual Studio.
As far as I understand, multithreading in Direct 3D 12 seems, in a nutshell, populating command lists in multiple threads.
If it is right, it seems
1 Swap Chain
1 Command Queue
N Command Lists (N corresponds to number of threads)
N Command Allocators (N corresponds to number of threads)
1 Fence
is enough for a single window program.
I wonder
Q1. When do we need multiple command queues?
Q2. Why do we need multiple fences?
Q3. When do we submit commands multiple times?
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
Q3 comes from here.
https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/GDC16/GDC16_gthomas_adunn_Practical_DX12.pdf
Purpose of Q4 is I thought of calling the function once and store the value for reuse, it didn't change when I debugged.
Rendering loop in my mind is (based on Game Loop pattern), for example,
Thread waits for fence value (eg. Main thread).
Begin multiple threads to populate command lists.
Wait all threads done with population.
ExecuteCommandLists.
Swap chain present.
Return to 1 in the next loop.
If I am totally misunderstanding, please help.

Q1. When do we need multiple command queues?
Read this https://learn.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization:
Asynchronous and low priority GPU work. This enables concurrent execution of low priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.
High priority compute work. With background compute it is possible to interrupt 3D rendering to do a small amount of high priority compute work. The results of this work can be obtained early for additional processing on the CPU.
Background compute work. A separate low priority queue for compute workloads allows an application to utilize spare GPU cycles to perform background computation without negative impact on the primary rendering (or other) tasks.
Streaming and uploading data. A separate copy queue replaces the D3D11 concepts of initial data and updating resources. Although the application is responsible for more details in the Direct3D 12 model, this responsibility comes with power. The application can control how much system memory is devoted to buffering upload data. The app can choose when and how (CPU vs GPU, blocking vs non-blocking) to synchronize, and can track progress and control the amount of queued work.
Increased parallelism. Applications can use deeper queues for background workloads (e.g. video decode) when they have separate queues for foreground work.
Q2. Why do we need multiple fences?
All gpu work is asynchronous. So you can think of fences as low level tools to achieve the same result as futures/coroutines. You can check if the work has been completed, wait for work to complete or set an event on completion. You need a fence whenever you need to guarantee a resource holds the output of work (when resource barriers are insufficient).
Q4. Does GetCPUDescriptorHandleForHeapStart() return value changes?
No it doesn't.
store the value for reuse, it didn't change when I debugged.
The direct3d12 samples do this, you should know them intimately if you want to become proficient.
Rendering loop in my mind is (based on Game Loop pattern), for example,
That sounds okay, but I urge you to look at the direct3d12 samples and steal the patterns (and the code) they use there.

Strange behaviors of cuda kernel with infinite loop on different NVIDIA GPU

#include <cstdio>
__global__ void loop(void) {
int smid = -1;
if (threadIdx.x == 0) {
asm volatile("mov.u32 %0, %%smid;": "=r"(smid));
printf("smid: %d\n", smid);
}
while (1);
}
int main() {
loop<<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}
This is my source code, the kernel just print smid when thread index is 0 and then go to infinite loop, and the host just invoke the previous cuda kernel and wait for it. I run some experiments under 2 different configurations as following:
1. GPU(Geforce 940M) OS(Ubuntu 18.04) MPS(Enable) CUDA(v11.0)
2. GPU(Geforce RTX 3050Ti Mobile) OS(Ubuntu 20.04) MPS(Enable) CUDA(v11.4)
Experiment 1: When I run this code under configuration 1, the GUI system seems to get freezed because any graphical responses cannot be observed anymore, but as I press ctrl+c, this phenomena disappears as the CUDA process is killed.
Experiment 2: When I run this code under configuration 2, the system seems to work well without any abnormal phenomena, and the output of smid such as smid: 2\n can be displayed.
Experiment 3: As I change the block configuration loop<<<1, 1024>>> and run this new code twice under configuration 2, I get the same smid output such as smid: 2\nsmid: 2\n.(As for Geforce RTX 3050Ti Mobile, the amount of SM is 20, the maximum number of threads per multiprocessor is 1536 and max number of threads per block is 1024.)
I'm confused with these results, and here are my questions:
1. Why doesn't the system output smid under configuration 1?
2. Why does the GUI system seems to get freezed under configuration 1?
3. Unlike experiment 1, why does experiment 2 output smid normally?
4. In third experiment, the block configuation reaches to 1024 threads, which means that two different block cannot be scheduled to the same SM. Under MPS environment, all CUDA contexts will be merged into one CUDA context and share the GPU resource without timeslice anymore, but why do I still get same smid in the third experiment?(Furthermore, as I change the grid configuration into 10 and run it twice, the smid varies from 0 to 19 and each smid just appears once!)

Why doesn't the system output smid under configuration 1?
A safe rule of thumb is that unlike host code, in-kernel printf output will not be printed to the console at the moment the statement is encountered, but at the point of completion of the kernel and device synchronization with the host. This is the actual regime in effect in configuration 1, which is using a maxwell gpu. So no printf output is observed in configuration 1, because the kernel never ends.
Why does the GUI system seems to get freezed under configuration 1?
For the purpose of this discussion, there are two possible regimes: a pre-pascal regime in which compute-preemption is not possible, and a post-pascal regime in which it is possible. Your configuration 1 is a maxwell device, which is pre-pascal. Your configuration 2 is ampere device, which is post-pascal. So in configuration 2, compute preemption is working. This has a variety of impacts, one of which is that the GPU will service both GUI needs as well as compute kernel needs, "simultaneously" (the low level behavior is not thoroughly documented but is a form of time-slicing, alternating attention to the compute kernel and the GUI). Therefore in config 1, pre-pascal, kernels running for any noticeable time at all will "freeze" the GUI during kernel execution. In config2, the GPU services both, to some degree.
Unlike experiment 1, why does experiment 2 output smid normally?
Although its not well-documented, the compute preemption process appears to introduce an additional synchronization point, allowing for the flushing of the printf buffer, as mentioned in point 1. If you read the documentation I linked there, you will see that "synchronization point" covers a number of possibilities, and compute preemption seems to introduce (a new) one.
Sorry, won't be able to answer your 4th question at this time. A best practice on SO is to ask one question per question. However, I would consider usage of MPS with a GPU that is also servicing a display to be "unusual". Since we've established that compute preemption is in effect here, it may be that due to compute-preemption as well as the need to service a display, the GPU services clients in a round-robin timeslicing fashion (since it must do so anyway to service the display). In that case the behavior under MPS may be different. Compute preemption allows for the possibility of the usual limitations you are describing to be voided. One kernel can completely replace another.

Does passing multiple devices to a context, mean kernels will be distributed to both devices?

I've been writing some basic OpenCL programs and running them on a single device. I have been timing the performance of the code, to get an idea of how well each program is performing.
I have been looking at getting my kernels to run on the platforms GPU device, and CPU device at the same time. The cl::context constructor can be passed a std::vector of devices, to initialise a context with multiple devices. I have a system with a single GPU and a single CPU.
Is constructing a context with a vector of available devices all you have to do for kernels to be distributed to multiple devices? I noticed a significant performance increase when I did construct the context with 2 devices, but it seems too simple.
There is a DeviceCommandQueue object, perhaps I should be using that to create a queue for each device explicitly?

I did some testing on my system. Indeed you can do something like this:
using namespace cl;
Context context({ devices[0], devices[1] });
queue = CommandQueue(context); // queue to push commands for the device, however do not specify which device, just hand over the context
Program::Sources source;
string kernel_code = get_opencl_code();
source.push_back({ kernel_code.c_str(), kernel_code.length() });
Program program(context, source);
program.build("-cl-fast-relaxed-math -w");
I found that if the two devices are from different platforms (like one Nvidia GPU and one Intel GPU), either clCreateContext will throw read access violation error at runtime or program.build will fail at runtime. If however the two devices are from the same platform, the code will compile and run; but it won't run on both devices. I tested with an Intel i7-8700K CPU and its integraed Intel UHD 630 GPU, and no matter the order of the devices in the vector that context is created on, the code will always be executed on the CPU in this case. I checked with Windows Task-Manager and also the results from kernel execution time measurement (execution times are specific for each device).
You could also monitor device usage with some tool like Task-Manager to see which device is actually running. Let me know if it is any different on your system than what I observed.
Generally parallelization across multiple devices is not done by handing the context a vector of devices, but instead you give each device a dedicated context and queue and explicitely handle which kernels are executed on which queue. This gives you full control on memory transfers and execution order / synchronization points.

I was seeing a performance increase when passing in the vector of devices. I downloaded a CPU/GPU profiler to actually check the activity of my GPU and CPU while I was running the code and it seemed that I was seeing activity on both devices. The CPU was registering around 95-100% activity, and the GPU was getting up to 30-40%, so OpenCL must be splitting the kernels between the 2 devices. My computer has a CPU with an integrated GPU, which may play a role in why the kernels are being shared across the devices, because its not like it has a CPU and a completely seperate GPU, they're connected on the same component.

Idendify the reason for a 200 ms freezing in a time critical loop

New description of the problem:
I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?
Original question:
In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU #3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).
My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.
My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?
Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.

Locking and releasing a mutex just to switch one bool variable will not take 200ms.
Main problem is probably that two threads are blocking each other in some way.
This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.
For further reading I recommend this article for a read, which describes lock contention with more detailed level.

Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.
VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)
then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"
The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck

After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx in the chat). The following steps revealed what the software had been waiting on:
Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.
Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

CUDA: How many concurrent threads in total on a device?

How do I programatically find the maximum number of concurrent cuda threads or streaming multiprocessors on a device / nvidia graphics card? I know about warpSize, but there is no warpCount.
most answers on the internet concern themselves with looking up things from pdfs.

Have you tried checking their SDK samples , i think this sample is the one you want
Device Query

This does not only depend on the device but also on your code - e.g. things like the number of registers each thread uses or the amount of shared memory your block needs. I would suggest reading about occupancy.
Another thing I would note is that if your code relies on having a certain number of threads resident on the device (e.g. if you wait for several threads to reach some execution point) you are bound to face some race conditions and see your code hanging.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js