Is it possible to control compute units with OpenCL?

Is it possible to control compute units with OpenCL? - concurrency

I couldn't find an answer to this in any documentation I've read about OpenCL so I'm asking: is it possible to control which compute unit executes which algorithm? I want to make one algorithm execute on compute unit 1 and another (different) algorithm execute on compute unit 2 concurrently. I want to be able to define on which compute unit to execute a kernel and possibly on how many processing elements/CUDA cores.
My GPU is Nvidia GeForce GT 525M, it has 2 compute units and 48 CUDA cores per each unit.

No, that's not possible. Nor would you want to do that. The GPU knows better than you how to schedule the work to make most of the device, you should not
(and are not able to) micro-manage that. You can of course influence the scheduling by setting your global and local work group size.
If you have two algorithms, A and B, and both are able to fully utilize the GPU, then there is no reason you should run them in parallel.
Sequentially:
CU 1: AAAAB
CU 2: AAAAB
In parallel:
CU 1: AAAAAAAA
CU 2: BB
Running them in parallel will actually make the total runtime longer if A and B don't have the exact same runtime: runtime is slowest(runtime(A), runtime(B)) vs runtime(A/2) + runtime(B/2).
If this doesn't help you, I suggest you ask a question where you detail your actual use case. What two algorithms you have, what data you have to run them on, what their device usage is, and why you want to run them in parallel.

Related

Open CL Running parallel tasks on data parallel kernel

I'm currently reading up on the OpenCL framework because of reasons regarding my thesis work. And what I've come across so far is that you can either run kernels in data parallel or in task parallel. Now I've got a question and I can't manage to find the answer.
Q: Say that you have a vector that you want to sum up. You can do that in OpenCL by writing a kernel for a data parallel process and just run it. Fairly simple.
However, now say that you have 10+ different vectors that need to be summed up also. Is it possible to run these 10+ different vectors in task parallel, while still using a kernel that processes them as "data parallel"?
So you basically parallelize tasks, which in a sense are run in parallel? Because what I've come to understand is that you can EITHER run the tasks parallel, or just run one task itself in parallel.

The whole task-parallel/data-parallel distinction in OpenCL was a mistake. We deprecated clEnqueueTask in OpenCL 2.0 because it had no meaning.
All enqueued entities in OpenCL can be viewed as tasks. Those tasks may be run concurrently, they may be run in parallel, they may be serialized. You may need multiple queues to run them concurrently, or a single out-of-order queue, this is all implementation-defined to be fully flexible.
Those tasks may be data-parallel, if they are made of multiple work-items working on different data elements within the same task. They may not be, consisting of only one work-item. This last definition is what clEnqueueTask used to provide - however, because it had no meaning whatsoever compared with clEnqueueNDRangeKernel with a global size of (1,1,1), and it was not checked against anything in the kernel code, deprecating it was the safer option.
So yes, if you enqueue multiple NDRanges, you can have multiple tasks in parallel, each one of which is data-parallel.
You can also copy all of those vectors at once inside one data-parallel kernel, if you are careful with the way you pass them in. One option would be to launch a range of work-groups, each one iterates through a single vector copying it (that might well be the fastest way on a CPU for cache prefetching reasons). You could have each work-item copy one element using some complex lookup to see which vector to copy from, but that would likely have high overhead. Or you can just launch multiple parallel kernels, each for one kernel, and have the runtime decide if it can run them together.

If your 10+ different vectors are close to the same size, it becomes a data parallel problem.
The task parallel nature of OpenCL is more suited for CPU implementations. GPUs are more suited for data parallel work. Some high-end GPUs can have a handful of kernels in-flight at once, but their real efficiency is in large data parallel jobs.

Is it possible to execute multiple instances of a CUDA program on a multi-GPU machine?

Background:
I have written a CUDA program that performs processing on a sequence of symbols. The program processes all sequences of symbols in parallel with the stipulation that all sequences are of the same length. I'm sorting my data into groups with each group consisting entirely of sequences of the same length. The program processes 1 group at a time.
Question:
I am running my code on a Linux machine with 4 GPUs and would like to utilize all 4 GPUs by running 4 instances of my program (1 per GPU). Is it possible to have the program select a GPU that isn't in use by another CUDA application to run on? I don't want to hardcode anything that would cause problems down the road when the program is run on different hardware with a greater or fewer number of GPUs.

The environment variable CUDA_VISIBLE_DEVICES is your friend.
I assume you have as many terminals open as you have GPUs. Let's say your application is called myexe
Then in one terminal, you could do:
CUDA_VISIBLE_DEVICES="0" ./myexe
In the next terminal:
CUDA_VISIBLE_DEVICES="1" ./myexe
and so on.
Then the first instance will run on the first GPU enumerated by CUDA. The second instance will run on the second GPU (only), and so on.
Assuming bash, and for a given terminal session, you can make this "permanent" by exporting the variable:
export CUDA_VISIBLE_DEVICES="2"
thereafter, all CUDA applications run in that session will observe only the third enumerated GPU (enumeration starts at 0), and they will observe that GPU as if it were device 0 in their session.
This means you don't have to make any changes to your application for this method, assuming your app uses the default GPU or GPU 0.
You can also extend this to make multiple GPUs available, for example:
export CUDA_VISIBLE_DEVICES="2,4"
means the GPUs that would ordinarily enumerate as 2 and 4 would now be the only GPUs "visible" in that session and they would enumerate as 0 and 1.
In my opinion the above approach is the easiest. Selecting a GPU that "isn't in use" is problematic because:
we need a definition of "in use"
A GPU that was in use at a particular instant may not be in use immediately after that
Most important, a GPU that is not "in use" could become "in use" asynchronously, meaning you are exposed to race conditions.
So the best advice (IMO) is to manage the GPUs explicitly. Otherwise you need some form of job scheduler (outside the scope of this question, IMO) to be able to query unused GPUs and "reserve" one before another app tries to do so, in an orderly fashion.

There is a better (more automatic) way, which we use in PIConGPU that is run on huge (and different) clusters.
See the implementation here: https://github.com/ComputationalRadiationPhysics/picongpu/blob/909b55ee24a7dcfae8824a22b25c5aef6bd098de/src/libPMacc/include/Environment.hpp#L169
Basically: Call cudaGetDeviceCount to get the number of GPUs, iterate over them and call cudaSetDevice to set this as the current device and check, if that worked. This check could involve test-creating a stream due to some bug in CUDA which made the setDevice succeed but all later calls failed as the device was actually in use.
Note: You may need to set the GPUs to exclusive-mode so a GPU can only be used by one process. If you don't have enough data of one "batch" you may want the opposite: Multiple process submit work to one GPU. So tune according to your needs.
Other ideas are: Start a MPI-application with the same number of processes per rank as there are GPUs and use the same device number as the local rank number. This would also help in applications like yours that have different datasets to distribute. So you can e.g. have MPI rank 0 process length1-data and MPI rank 1 process length2-data etc.

parallel processing library

I would like to know which parallel processing library to be best used under these configurations:
A single quad core machine. I would like to execute four functions of the same type on each core. The same function takes different arguments.
A cluster of 4 machines with each one with multi core. I would like to execute the same functions but n-parallel ( 4 machines * no of cores in each machine ). So I want it to scale.
Program details :
C++ program. There is no dependency between functions. The same function gets executed with different set of inputs and gets completed for > 100 times
There is no shared memory as each function takes its own data and its own inputs.
Each function need not to wait for others to complete. There is no need of join or fork.
For above scenarios what is the best parallel libs can be used? MPI, BOOST::MPI, open mp or other libs.
My preference would be BOOST::MPI but I want some recommendations. I am not sure if using MPI is allowed with parallel multi core machines?
Thanks.

What you have here is an embarassingly parallel problem (http://en.wikipedia.org/wiki/Embarrassingly_parallel). While MPI can definitely be used on a multi-core machine, it could be over kill for the problem at hand. If your tasks are completely separated, you could just compile them in to separate executables or a single executable with different inputs and use "make -j [n]" (see http://www.gnu.org/software/make/manual/html_node/Parallel.html) to execute them in parallel.
If MPI comes naturally to you, by all means, use it. OpenMP probably won't cut it if you want to control computing on separate computers within a cluster.

Concurrent kernel execution and OpenCL device partition

Recently I needed to do some experiments which need run multiple different kernel on AMD hardware. But I have several questions before starting to coding hence I really need your help.
First, I am not quite sure whether AMD HW can support concurrent kernel execution on one device. Because when I refer to the OpenCL specs, they said the command queue can be created as in-order and out-of-order. But I don't "out-of-order" mean "concurrent execution". Is there anyone know info about this? My hardware is AMD APU A8 3870k. If this processor does not support, any other AMD products support?
Second, I know there is an extension "device fission" which can be used to partition one device into two devices. This works only on CPU now. But in OpenCL specs, I saw something, i.e. "clcreatesubdevice", which is also used to partition one device into two? So my question is is there any difference between these two techniques? My understanding is: device fission can only be used on CPU, clcreatesubdevice can be used on both the CPU and the GPU. Is that correct?
Thanks for any kind reply!

Real concurrent kernels is not a needed feature and causes so much troubles to driver developers. As far as I know, AMD does not support this feature without the subdevice split. As you mentioned, "out-of-order" is not cuncurrent, is just a out of order execution of the queue.
But what is the point in running both of them in parallel at half the speed instead of sequentially at full speed? You will probably loose overall performance if you do it in such a way.
I recomend you to use more GPU devices (or GPU + CPU) if you run out of resources in one of the GPUs. Optimizing could be a good option too. But splitting is never a good option for real scenario, only for academic purposes or testing.

Complex loop in a C++ program portable to OpenMP and MPI?

I have a C++ number crunching program. The structure is:
a) data input, data preparation
b) "big" loop, uses global and local data (lots of different variables in both cases)
c) postprocess results and write data
The most intensive part is "b", which is basically a loop. I need to speedup the program in a cluster. 25 blades, 4 cores each. I wonder whether I could use here OpenMP and MPI, or if you can point me to tutorials, not general cases, but complex and "big" for loops.
Thanks

Actually, you should use both.
Use MPI to distribute tasks between blades and OpenMP to fully utilize each blade. Take some time to understand how memory and sharing works on each case.

You cannot devide your task between blade using OpenMP. Try to devide you loop on several part and distribute capacity on them.
For example if you want composition of 2 vectors with N size. N/2 will be on one node and another part on another.
But transmition costs between blades is palpable. Thus if your task is not actually great. May be would be better if you distribute it into 4 cores.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js