Open CL Running parallel tasks on data parallel kernel

Open CL Running parallel tasks on data parallel kernel - c++

I'm currently reading up on the OpenCL framework because of reasons regarding my thesis work. And what I've come across so far is that you can either run kernels in data parallel or in task parallel. Now I've got a question and I can't manage to find the answer.
Q: Say that you have a vector that you want to sum up. You can do that in OpenCL by writing a kernel for a data parallel process and just run it. Fairly simple.
However, now say that you have 10+ different vectors that need to be summed up also. Is it possible to run these 10+ different vectors in task parallel, while still using a kernel that processes them as "data parallel"?
So you basically parallelize tasks, which in a sense are run in parallel? Because what I've come to understand is that you can EITHER run the tasks parallel, or just run one task itself in parallel.

The whole task-parallel/data-parallel distinction in OpenCL was a mistake. We deprecated clEnqueueTask in OpenCL 2.0 because it had no meaning.
All enqueued entities in OpenCL can be viewed as tasks. Those tasks may be run concurrently, they may be run in parallel, they may be serialized. You may need multiple queues to run them concurrently, or a single out-of-order queue, this is all implementation-defined to be fully flexible.
Those tasks may be data-parallel, if they are made of multiple work-items working on different data elements within the same task. They may not be, consisting of only one work-item. This last definition is what clEnqueueTask used to provide - however, because it had no meaning whatsoever compared with clEnqueueNDRangeKernel with a global size of (1,1,1), and it was not checked against anything in the kernel code, deprecating it was the safer option.
So yes, if you enqueue multiple NDRanges, you can have multiple tasks in parallel, each one of which is data-parallel.
You can also copy all of those vectors at once inside one data-parallel kernel, if you are careful with the way you pass them in. One option would be to launch a range of work-groups, each one iterates through a single vector copying it (that might well be the fastest way on a CPU for cache prefetching reasons). You could have each work-item copy one element using some complex lookup to see which vector to copy from, but that would likely have high overhead. Or you can just launch multiple parallel kernels, each for one kernel, and have the runtime decide if it can run them together.

If your 10+ different vectors are close to the same size, it becomes a data parallel problem.
The task parallel nature of OpenCL is more suited for CPU implementations. GPUs are more suited for data parallel work. Some high-end GPUs can have a handful of kernels in-flight at once, but their real efficiency is in large data parallel jobs.

Related

How to execute parallel compute shaders across multiple compute queues in Vulkan?

Update: This has been solved, you can find further details here: https://stackoverflow.com/a/64405505/1889253
A similar question was asked previously, but that question was initially focused around using multiple command buffers, and triggering the submit across different threads to achieve parallel execution of shaders. Most of the answers suggest that the solution is to use multiple queues instead. The use of multiple queues also seems to be the consensus across various blog posts and Khronos forum answers. I have attempted those suggestions running shader executions across multiple queues but without being able to see parallel execution, so I wanted to ask what I may be doing wrong. As suggested, this question includes the runnable code of multiple compute shaders being submitted to multiple queues, which hopefully can be useful for other people looking to do the same (once this is resolved).
The current implementation is in this pull request / branch, however I will cover the main Vulkan specific points, to ensure only Vulkan knowledge is required to answer this question. It's also worth mentioning that the current use-case is specifically for compute queues and compute shaders, not graphics or transfer queues (although insights/experience achieving parallelism across those would still be very useful, and would most probably also lead to the answer).
More specifically, I have the following:
Multiple queues first are "fetched" - my device is a NVIDIA 1650, and supports 16 graphics+compute queues in queue family index 0, and 8 compute queues in queue family index 2
evalAsync performs the submission (which contains recorded shader commands) - You should notice that a fence is created which we'll be able to use. Also the submit doesn't have any waitStageMasks (PipelineStageFlags).
evalAwait allows us to wait for the fence - When calling the evalAwait, we are able to wait for the submission to finish through the created fence
A couple of points that are not visible in the examples above but are important:
All evalAsync run on the same application, instance and device
Each evalAsync executes with its own separate commandBuffer and buffers, and in a separate queue
If you are wondering whether memory barriers could be having something to do, we have tried by removing all memoryBarriers (this on for example that runs before shader execution) completely but this has not made any difference on performance
The test that is used in the benchmark can be found here, however the only key things to understand are:
This is the shader that we use for testing, as you can see, we just add a bunch of atomicAdd steps to increase the amount of processing time
Currently the test has small buffer size and high number of shader loop iterations, but we also tested with large buffer size (i.e. 100,000 instead of 10), and smaller iteration (1,000 istead of 100,000,000).
When running the test, we first run a set of "synchronous" shader executions on the same queue (the number is variable but we've tested with 6-16, the latter which is the max number of queues). Then we run these in an asychrnonous manner, where we run all of them and the evalAwait until they are finished. When comparing the resulting times from both approaches, they take the same amount of time eventhough they run across different compute queues.
My questions are:
Am I currently missing something when fetching the queues?
Are there further parameters in the vulkan setup that need to be configured to ensure asynchronous execution?
Are there any restrictions I may not be aware about around potentially operating system processes only being able to submit GPU workloads in a synchronous way to the GPU?
Would multithreading be required in order for parallel execution to work properly when dealing with multiple queue submissions?
Furthermore I have found several useful resources online across various reddit posts and Khronos Group forums that provide very in-depth conceptual and theoretical overviews on the topic, but I haven't come across end to end code examples that show parallel execution of shaders. If there are any practical examples out there that you can share, which have funcioning parallel execution of shaders, that would be very helpful.
If there are further details or questions that can help provide further context please let me know, happy to answer them and/or provide more detail.
For completeness, my tests were using:
Vulkan SDK 1.2
Windows 10
NVIDIA 1650
Other relevant links that have been shared in similar posts:
Similar discussion with suggested link to example but which seems to have disappeared...
Post on Leveraging asynchronous queues for concurrent execution (unfortunately no example code)
(Relatively old - 5 years) Post that suggests nvidia cards can't do parallel execution of shaders, but doesn't seem to have a conculsive answer
Nvidia presentation on Vulkan Multithreading with multiple queue execution (hence my question above on threads)

You are getting "asynchronous execution". You just don't expect it to behave the way it behaves.
On a CPU, if you have one thread active, then you're using one CPU core (or hyper-thread). All of that core's execution and computation capabilities are given to your thread alone (ignoring pre-emption). But at the same time, if there are other cores, your one thread cannot use any of the computational resources of those cores. Not unless you create another thread.
GPUs don't work that way. A queue is not like a CPU thread. It does not specifically relate to a particular quantity of computational resources. A queue is merely the interface through which commands get executed; the underlying hardware decides how to farm out commands to the various compute resources provided by the GPU as a whole.
What generally happens when you execute a command is that the hardware attempts to fully saturate the available shader execution units using your command. If there are more shader units available than the number of invocations your operation requires, then some resources are available immediately for the next command. But if not, then the entire GPU's compute resources will be dedicated to executing the first operation; the second one must wait for resources to become available before it can start.
It doesn't matter how many compute queues you shove work into; they're all going to try to use as many compute resources as possible. So they will largely execute in some particular order.
Queue priority systems exist, but these mainly help determine the order of execution for commands. That is, if a high-priority queue has some commands that need to be executed, then they will take priority the next time compute resources become available for a new command.
So submitting 3 dispatch batches on 3 separate queues is not going to complete faster than submitting 1 batch on one queue containing 3 dispatch operations.
The main reason multiple queues (of the same family) exist is to be able to submit work from multiple threads without having them do inter-thread synchronization (and to provide some possible prioritization of submissions).

I have been able to solve using this suggestion. To provide further context, I was trying to submit commands to multiple queues within the same family, however it was pointed out in the suggestion linked, NVIDIA (and other GPU vendors) have a varying range of capabilities when it comes to parallel processing of command submissions.
In my particular case, the NVIDIA 1650 card I was testing with, only supports concurrent processing when workloads are submitted in different queueFamilies - more specifically, it is only able to support one concurrent command submission across one Graphics queue and one compute family queue.
I re-implemented the code to allow for allocation of family queues for specific commands, and I was able to achieve parallel processing (with a 2x speed improvement by submitting across two queueFamilies).
Here is further detail on the implementation https://kompute.cc/overview/async-parallel.html

Ways to process data which is coming at double speed than my processing speed

I was asked this question my someone and bit confused on same.
Q: how will you process the data which is coming at double speed than my processing speed?
I think of following:
using queue to handle this. But if I use simply queue then size of
queue required will be indefinetly large and i will still lag
behind. As every t time i will have half more data that I can
process. and I will keep laging exponentially.
I use one thread for reading data and two more for processing. But
suppose my data has to be processed serially then what happens.
Am still confused and any help on similar problems will be welcomed. I know there might be a standard solution for this but am unaware of same.
I would like to implement in c/c++

Short answer: you'll need some kind of parallel processing. It's not easy.
Long answer: Depending on your workload requirements, and whether the bottleneck is in IO or in CPU, it might simply be multithreading on a single core, or on a multicore processor, or on a shared memory multiprocessor or even distributed between multiple nodes. It can be just a matter of distributing and balancing your work between the worker units, if the problem is simple enough (embarrasingly parallel) or you'll need to explicitly do some parallel programming. There are fundamentally two parallel programming models: OpenMP, for multithreading in multicore systems with shared memory (either symmetric or non-uniform access); and MPI, for distributed processing in a low-latency high-bandwidth network. To complicate even further, OpenMP and MPI might perfectly run together, in a hybrid parallel programming runtime environment: OpenMP distributes and coordinates the parallel compute load between the cores on each node, and MPI does it between the nodes. Be aware, it is very tough work.

How is a parallel scan performed on an array with more elements than threads per block?

I have seen many implementations of parallel scan; the two main ones are Hillis & Steel and blelloch scan. Though all the implementations I have seen work within shared memory, memory only shared between threads in a block.
Are there any implementations of scan that work well over arrays that have more elements than threads per block, i.e. the array will not fit into shared memory?
This link mentions a scan implementation I see in all my searches, a Hillis Steele version, example 39-1 https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html.
Is the only option to do a segmented scan on sub arrays within the array and then do a "final scan" adding a magnitude value from the prior sub array to the next?

With or without shared memory, CUDA kernels execute in chunks (threadblocks) that can execute in any order. To take full advantage of the hardware, you must have multiple threadblocks in your kernel call, but this creates an uncertain execution order.
Because of this, a scan algorithm that works across a large array will necessarily have to work in threadblock-sized pieces (in some fashion). If we have multiple threadblocks, then a given threadblock has no way of knowing whether other threadblocks have finished their work on adjacent data. (Yes, there are contrived mechanisms to allow inter-threadblock communication, but these are fraught with difficulty and don't solve the problem on a large scale.)
The net effect of this is that algorithms like this generally imply a global sync of some sort, and the only safe-in-any-scenario global sync is the kernel launch. Threadblocks can do a portion of their work independently, but when it comes time to stitch the work of threadblocks together, we must wait until step A is completed across all threadblocks before proceeding with step B.
Therefore I think you'll find that most device-wide scan algorithms, including the chapter 39 GPU Gems example you linked, as well as thrust and cub will launch multiple kernels to get this job done, since the kernel launch gives a convenient global sync.
Note that we can certainly devise a scan that has individual threadblocks that "work on more elements than threads per block", but this does not ultimately solve our problem (unless we use only 1 threadblock), because we must launch multiple threadblocks to take full advantage of the hardware, and multiple threadblocks in the general case introduces the global sync necessity.
The cub and thrust implementations I mentioned are both open-source template libraries, so you can certainly study the code there if you wish (not a trivial undertaking). They do represent high-quality approaches designed and built by CUDA experts. You can also at a high level study their behavior quite easily using:
nvprof --print-gpu-trace ./mycode
to get a quick read on how many kernels are being launched and what data transfers may be occurring, or you can use nvvp, the visual profiler, to study this.

Executing C++ program on multiple processor machine

I developed a program in C++ for research purpose. It takes several days to complete.
Now i executing it on our lab 8core server machine to get results quickly, but i see machine assigns only one processor to my program and it remains at 13% processor usage(even i set process priority at high level and affinity for 8 cores).
(It is a simple object oriented program without any parallelism or multi threading)
How i can get true benefit from the powerful server machine?
Thanks in advance.

Partition your code into chunks you can execute in parallel.
You need to go read about data parallelism
and task parallelism.
Then you can use OpenMP or
MPI
to break up your program.

(It is a simple object oriented program without any parallelism or
multi threading)
How i can get true benefit from the powerful server machine?
By using more threads. No matter how powerful the computer is, it cannot spread a thread across more than one processor. Find independent portions of your program and run them in parallel.
C++0x threads
Boost threads
OpenMP
I personally consider OpenMP a toy. You should probably go with one of the other two.

You have to exploit multiparallelism explicitly by splitting your code into multiple tasks that can be executed independently and then either use thread primitives directly or a higher level parallelization framework, such as OpenMP.

If you don't want to make your program itself use multithreaded libraries or techniques, you might be able to try breaking your work up into several independent chunks. Then run multiple copies of your program...each being assigned to a different chunk, specified by getting different command-line parameters.
As for just generally improving a program's performance...there are profiling tools that can help you speed up or find the bottlenecks in memory usage, I/O, CPU:
https://stackoverflow.com/questions/tagged/c%2b%2b%20profiling
Won't help split your work across cores, but if you can get an 8x speedup in an algorithm that might be able to help more than multithreading would on 8 cores. Just something else to consider.

Is concurrent programming the same as parallel programming?

Are they both the same thing? Looking just at what concurrent or parallel means in geometry, I'd definetely say no:
In geometry, two or more lines are said to be concurrent if they intersect at a single point.
and
Two lines in a plane that do not
intersect or meet are called parallel
lines.
Again, in programming, do they have the same meaning? If yes...Why?
Thanks

I agree that the geometry vocabulary is in conflict. Think of train tracks instead: Two trains which are on parallel tracks can run independently and simultaneously with little or no interaction. These trains run concurrently, in parallel.
The basic usage difficulty is that "concurrent" can mean "at the same time" (with the trains or code) or "at the same place" (with the geometric lines). For many practical purposes (trains, thread resources) these two notions are directly in conflict.
Natural language is supposed to be silly, ambiguous, and confusing. But we're programmers. We can take refuge in the clarity, simplicity, and elegance of our formal programming languages. Like perl.

From Wikipedia:
Concurrent computing is a form of
computing in which programs are
designed as collections of interacting
computational processes that may be
executed in parallel.
Basically, programs can be written as concurrent programs if they are made up of smaller interacting processes. Parallel programming is actually doing these processes at the same time.
So I suppose that concurrent programming is really a style that lends itself to processes being executed in parallel to improve performance.

No, definitely concurrent is different from parallel. here is exactly how.
Concurrency refers to the sharing of resources in the same time frame. As an example, several processes may share the same CPU or share memory or an I/O device.
Now, by definition two processes are concurrent if an only if the second starts execution before the first has terminated (on the same CPU). If the two processes both run on the same - say for now - single-core CPU the processes are concurrent but not parallel: in this case, parallelism is only virtual and refers to the OS doing timesharing. The OS seems to be executing several processes simultaneously. If there is only one single-core CPU, only one instruction from only one process can be executing at any particular time. Since the human time scale is billions of times slower than that of modern computers, the OS can rapidly switch between processes to give the appearance of several processes executing at the same time.
If you instead run the two processes on two different CPUs, the processes are parallel: there is no sharing in the same time frame, because each process runs on its own CPU. The parallelism in this case is not virtual but physical. It is worth noting here that running on different cores of the same multi-core CPU still can not be classified as fully parallel, because the processes will share the same CPU caches and will even contend for them.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js