OpenCL Verification of Parallel Execution

OpenCL Verification of Parallel Execution - c++

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.

Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.

Related

Is cv::dft transform multithreaded?

I need to boost cv::dft perfomance in multithreaded environment. I've done a simple test on Windows 10 on Core-i5 Intel processor:
Here I see that CPU is not fully loaded (50% usage only). Individual threads are loaded equally and also far from 100%. Why is that and how can I fix it? Can DFT easily pluralized? Is it implemented in OpenCV library? Are there special build flags to enable it (which)?
UPDATE: Running this code on linux gives a bit different result, but also below 100% utilization:

First of all, behavior of cv::dft depends on OpenCV build flags, for example if you set WITH_IPP, then it will use Intel Primitives to speedup computation. FFT is memory-bound, if you simply launch more threads, you most probably wouldn't significantly benefit from this parallelism, because threads will be waiting for each other to finish accessing memory, I've observed this both on Linux and Windows. To gain more performance you should use FFTW3 which has some sophisticated algorithm for multi-threaded mode (should be ./configure-d with special flag). I observed up to 7x speedup with 8 threads. But FFTW has only payed business-friendly license, imposing GNU license for your software. I have not found any other opensource components which can handle FFT parallelism in a smart way.

Applications well suited for Xeon-phi many-core architecture

From this https://software.intel.com/en-us/videos/purpose-of-the-mic-architecture I understand that applications with complex or numerous random memory access are not well suited for Intel Xeon-phi. This is because the architecture uses 61 cores and 8 memory controllers. In case of L1 and L2 cache misses, it takes up to 100s cycles to fetch the line into memory and get it ready for use by the CPU. Such applications are called latency-bound.
Then, the tutorial mentions that many-core architecture (Xeon-phi coprocessor only) are well suited for highly parallel homogeneous code. Two questions from there:
What is referred to as homogeneous code ?
What are real-world applications which can fully benefit from MIC architecture ?

I see the Intel MIC architecture as a "x86 based GPGPU" and if you are familiar with the concept of GPGPU you will find your self familiar with the Intel MIC.
An homogeneous clustering is a system infrastructure with multiple execution unit (i.e. CPUs) all with the same features. For example a multicore system that have four Intel Xeon processors is homogeneous.
An heterogeneous clustering is a system infrastructure with multiple execution unit with different features (i.e. like CPU and GPU). For example my Levono z510 with its Intel i7 Haswell (4 CPUs), its Nvidia GT740M (GPU) and its Intel HD Graphics 4600 (GPU) is an heterogeneous system.
An example of heterogeneous code could be a Video Game.
A video game has a control code, executed by one code of one CPU, that control what the other agents do, its send shaders to execute on the GPUs, physic computation to be performed on others cores or GPUs and so on.
In this example you need to write code that run on the CPU (so it is "CPU aware") and code that run on GPU (so it is "GPU aware"). This is actually done by using different tools, different programming languages and different programming models!
homogeneous code is code that don't need to be aware of n different programming models, one for each different kind of agent. It is just the same programming model, language and tool.
Take a look a this very simple sample code for the MPI library.
The code is all written in C, it is the same program that just take a different flow.
About the applications, Well that's really a broad question...
As said above I see the Intel MIC as a GPGPU based on x86 ISA (part of it at least).
An SDK particularly useful (and listed on the video you linked) to work with clustered systems is OpenCL, it can be used for fast processing of images and computer vision and basically for anything that need the same algorithm to be run billions of times with different inputs (like cryptography applications/brute forcing).
If you search for some OpenCL based project on the web you will get an idea.
To answer you second question it is better to ask ourselves "What could not take advantage of the MIC architecture?" and we will soon find that the more an algorithm is distant from the concept of Stream Processing and the related topics, including the one of Kernel, the less it is suitable for the MIC.

First a straight forward answer to your direct question - to get the most out of the coprocessor, your code should be able to use a large number of threads and should vectorize. How many threads? Well, you have 60 cores (+/- depending on which version you get) and 4 threads per core, with a sweet spot around 2 threads per core on many codes. Sometimes you can get good performance even if you don't use every single core. But vectorization is extremely important; the long (512 byte) vectors are a big source of speed on the coprocessor.
Now, on to programming. The Intel Xeon Phi coprocessor uses two different kinds of programming - offload and native.
In the offload model, you write a program, determine which parts of that code have enough parallelism to make use of the large number of cores on the coprocessor and mark those sections with offload directives. Then inside those offloaded sections, you write the code using some form of parallelism, like OpenMP. (Heterogeneous)
In native code, you do not use any offload directives but, instead, use a -mmic compiler directive. Then you run the code directly on the coprocessor. The code you write will use some form of parallelism, like OpenMP, to make use of the large number of cores the coprocessor has. (Homogeneous)
Another variation on these programming models is to use MPI, often in addition to OpenMP. You can use the offload programming model, in which case, the nodes in you MPI system will be the host nodes in your system. (Hybrid) Alternately, you can use the native programming model, in which case you treat the coprocessor as just another node in your system. (Heterogeneous if host and coprocessors are nodes; homogeneous if only coprocessors are use.)
You may have noticed that nothing I have said implies a separate programming style for the host and coprocessor. There are some optimizations you can make that will keep code written for the coprocessor from running on the processor as well but, in general, the code you write for the coprocessor can also be compiled for and run on the host by just changing the compiler options.
As far as real world apps, see https://software.intel.com/en-us/mic-developer/app-catalogs

opencl and opencv 3.0 Beta

why there is no openCL (ocl) in opencv 3.0 beta?
I heard that the new opencv transparently uses opencl, but when I am testing this on a windows running on a intel core i5 (gpu HD400), I can not see any speed improvement as the result of running on GPU.
Am I missing something here?

Ocl module of OpenCV is intentionally removed. Developers are no more expected to use ocl::Canny like invocations. These methods will be invoked internally by OpenCV. Developers are expected to use UMat structure as explained in the presentation. UMat wrapsclmem when OpenCL is available. Or else it default to CPU. See ocl.cpp.
Regarding speed, I would ensure below
In cvconfig.h in build directory, check if OpenCL flag is ON or OFF
In code, ocl::setUseOpenCL(true)
In code, Use UMat in place of Mat
Then check FPS with and with out call to ocl::setUseOpenCL(true);
What I will expect to see is not a drastic FPS increase. Even assuming GPU is used, there could be cases when data has to be copied between CPU/GPU memory and back and forth this might affect end performance. I will expect to see processing offloading to GPU and a less burden on CPU. Not necessarily speed increase.

With regard to the tools, you could use AMD´s CodeXL to observe the behavior of OpenCV/OpenCL. You can see the sequence of OpenCL API calls, the kernels used, their performance and source code, data buffers and its contents, etc. Of course, all this only on AMD hardware. I think for NVIDIA, ParallelInsight can do the same. For Intel, do not know which tool can help.

Run OpenCL without compatible hardware?

I have two PCs:
a new high-end desktop PC, OpenCL compatible CPU and GPU, 32GB RAM
a very old laptop, Intel Celeron CPU, 512MB RAM, Ati M200 GPU
I am writing an OpenCL/C++ sw on my desktop PC. But when I travel somewhere, I continue the work on my oldschool laptop. Programming C++ on this laptop is good, but I can't try the OpenCL parts of my code. So this time I am writing OpenCL code, but I don't know it is good or not.
Is there a way, to virtualize an OpenCL compatible CPU/GPU? I don't want to get high performance, I just want to try my code, doesn't matter if it is very slow (slower than if I run it 1-thread on my Celeron CPU).
I guess, the answer is no.
(BTW, my plan is, there will be an option in my program, and you can run it with or without OpenCL. This is also needed to measure performance, and compare OpenCL CPU/GPU, and CPU in 1-thread mode without OpenCL.)
almost an answer, but not completely what I am looking for: http://www.acooke.org/cute/Developing0.html

For all existing OpenCL implementations, you need some form of SSE.
A website gathering all this info is here.
The lowest requirements are provided by the AMD OpenCL drivers, which require SSE3. As the list shows, that goes all the way back to late Pentium 4's.
In order to be sure about your CPU's capabilities, you'll need to use something like CPU-Z which can show the capabilities of your processor.
All that aside, I searched for laptops with your GPU, and ended up with processors like the Intel Celeron M 420, which according to Intel doesn't even have 64-bit support (which would imply SSE2).
I currently know of no other OpenCL implementations that are worth anything, so the answer would be no.
On the other hand, some websites claim that processor has SSE3 support, so that would mean AMD's OpenCL SDK is your option of choice.

Concurrent kernel execution and OpenCL device partition

Recently I needed to do some experiments which need run multiple different kernel on AMD hardware. But I have several questions before starting to coding hence I really need your help.
First, I am not quite sure whether AMD HW can support concurrent kernel execution on one device. Because when I refer to the OpenCL specs, they said the command queue can be created as in-order and out-of-order. But I don't "out-of-order" mean "concurrent execution". Is there anyone know info about this? My hardware is AMD APU A8 3870k. If this processor does not support, any other AMD products support?
Second, I know there is an extension "device fission" which can be used to partition one device into two devices. This works only on CPU now. But in OpenCL specs, I saw something, i.e. "clcreatesubdevice", which is also used to partition one device into two? So my question is is there any difference between these two techniques? My understanding is: device fission can only be used on CPU, clcreatesubdevice can be used on both the CPU and the GPU. Is that correct?
Thanks for any kind reply!

Real concurrent kernels is not a needed feature and causes so much troubles to driver developers. As far as I know, AMD does not support this feature without the subdevice split. As you mentioned, "out-of-order" is not cuncurrent, is just a out of order execution of the queue.
But what is the point in running both of them in parallel at half the speed instead of sequentially at full speed? You will probably loose overall performance if you do it in such a way.
I recomend you to use more GPU devices (or GPU + CPU) if you run out of resources in one of the GPUs. Optimizing could be a good option too. But splitting is never a good option for real scenario, only for academic purposes or testing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js