opencl and opencv 3.0 Beta - c++

why there is no openCL (ocl) in opencv 3.0 beta?
I heard that the new opencv transparently uses opencl, but when I am testing this on a windows running on a intel core i5 (gpu HD400), I can not see any speed improvement as the result of running on GPU.
Am I missing something here?

Ocl module of OpenCV is intentionally removed. Developers are no more expected to use ocl::Canny like invocations. These methods will be invoked internally by OpenCV. Developers are expected to use UMat structure as explained in the presentation. UMat wrapsclmem when OpenCL is available. Or else it default to CPU. See ocl.cpp.
Regarding speed, I would ensure below
In cvconfig.h in build directory, check if OpenCL flag is ON or OFF
In code, ocl::setUseOpenCL(true)
In code, Use UMat in place of Mat
Then check FPS with and with out call to ocl::setUseOpenCL(true);
What I will expect to see is not a drastic FPS increase. Even assuming GPU is used, there could be cases when data has to be copied between CPU/GPU memory and back and forth this might affect end performance. I will expect to see processing offloading to GPU and a less burden on CPU. Not necessarily speed increase.

With regard to the tools, you could use AMD´s CodeXL to observe the behavior of OpenCV/OpenCL. You can see the sequence of OpenCL API calls, the kernels used, their performance and source code, data buffers and its contents, etc. Of course, all this only on AMD hardware. I think for NVIDIA, ParallelInsight can do the same. For Intel, do not know which tool can help.

Related

Opencv acceleration with CUDA in C++

I am HPC student and I have project coding by OpenCV functions and C++. I have to paralles the code for high performance, so I decide to use CUDA acceleration. I confused with the following…
For getting a high performance , is it enough of using only CUDA?
Can I use both OpenCV::GPU or OpenCV::CUDA with Cuda GPU?
What is different between OpenCV::GPU and OpenCV::CUDA?
CUDA programming can be only used if you have NVIDIA cards. Power of General purpose GPU hardware will be utilized only if you do parallel processing.
For example if you are working with images, every pixels of the images have individual operation. Then GPU programming helps in saving you computation time.
In your application second pixel input depends on first pixel input. Then its better run your application in CPU itself. Again data transfer from CPU to GPU and GPU to CPU will also affect performance. Need to take care while you code.
2 & 3. OpenCV2 versions syntax cv::gpu, whereas OpenCV3 version syntax is cv::cuda. It depends on which opencv version you use.

Run OpenCL without compatible hardware?

I have two PCs:
a new high-end desktop PC, OpenCL compatible CPU and GPU, 32GB RAM
a very old laptop, Intel Celeron CPU, 512MB RAM, Ati M200 GPU
I am writing an OpenCL/C++ sw on my desktop PC. But when I travel somewhere, I continue the work on my oldschool laptop. Programming C++ on this laptop is good, but I can't try the OpenCL parts of my code. So this time I am writing OpenCL code, but I don't know it is good or not.
Is there a way, to virtualize an OpenCL compatible CPU/GPU? I don't want to get high performance, I just want to try my code, doesn't matter if it is very slow (slower than if I run it 1-thread on my Celeron CPU).
I guess, the answer is no.
(BTW, my plan is, there will be an option in my program, and you can run it with or without OpenCL. This is also needed to measure performance, and compare OpenCL CPU/GPU, and CPU in 1-thread mode without OpenCL.)
almost an answer, but not completely what I am looking for: http://www.acooke.org/cute/Developing0.html
For all existing OpenCL implementations, you need some form of SSE.
A website gathering all this info is here.
The lowest requirements are provided by the AMD OpenCL drivers, which require SSE3. As the list shows, that goes all the way back to late Pentium 4's.
In order to be sure about your CPU's capabilities, you'll need to use something like CPU-Z which can show the capabilities of your processor.
All that aside, I searched for laptops with your GPU, and ended up with processors like the Intel Celeron M 420, which according to Intel doesn't even have 64-bit support (which would imply SSE2).
I currently know of no other OpenCL implementations that are worth anything, so the answer would be no.
On the other hand, some websites claim that processor has SSE3 support, so that would mean AMD's OpenCL SDK is your option of choice.

OpenCL Verification of Parallel Execution

What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.

OpenGL vs. OpenCL, which to choose and why?

What features make OpenCL unique to choose over OpenGL with GLSL for calculations? Despite the graphic related terminology and inpractical datatypes, is there any real caveat to OpenGL?
For example, parallel function evaluation can be done by rendering a to a texture using other textures. Reducing operations can be done by iteratively render to smaller and smaller textures. On the other hand, random write access is not possible in any efficient manner (the only way to do is rendering triangles by texture driven vertex data). Is this possible with OpenCL? What else is possible not possible with OpenGL?
OpenCL is created specifically for computing. When you do scientific computing using OpenGL you always have to think about how to map your computing problem to the graphics context (i.e. talk in terms of textures and geometric primitives like triangles etc.) in order to get your computation going.
In OpenCL you just formulate you computation with a calculation kernel on a memory buffer and you are good to go. This is actually a BIG win (saying that from a perspective of having thought through and implemented both variants).
The memory access patterns are though the same (your calculation still is happening on a GPU - but GPUs are getting more and more flexible these days).
But what else would you expect than using more than a dozen parallel "CPUs" without breaking your head about how to translate - e.g. (silly example) Fourier to Triangles and Quads...?
Something that hasn't been mentioned in any answers so far has been speed of execution. If your algorithm can be expressed in OpenGL graphics (e.g. no scattered writes, no local memory, no workgroups, etc.) it will very often run faster than an OpenCL counterpart. My specific experience of this has been doing image filter (gather) kernels across AMD, nVidia, IMG and Qualcomm GPUs. The OpenGL implementations invariably run faster even after hardcore OpenCL kernel optimization. (aside: I suspect this is due to years of hardware and drivers being specifically tuned to graphics orientated workloads.)
My advice would be that if your compute program feels like it maps nicely to the graphics domain then use OpenGL. If not, OpenCL is more general and simpler to express compute problems.
Another point to mention (or to ask) is whether you are writing as a hobbyist (i.e. for yourself) or commercially (i.e. for distribution to others). While OpenGL is supported pretty much everywhere, OpenCL is totally lacking support on mobile devices and, imho, is highly unlikely to appear on Android or iOS in the next few years. If wide cross platform compatibility from a single code base is a goal then OpenGL may be forced upon you.
What features make OpenCL unique to choose over OpenGL with GLSL for calculations? Despite the graphic related terminology and inpractical datatypes, is there any real caveat to OpenGL?
Yes: it's a graphics API. Therefore, everything you do in it has to be formulated along those terms. You have to package your data as some form of "rendering". You have to figure out how to deal with your data in terms of attributes, uniform buffers, and textures.
With OpenGL 4.3 and OpenGL ES 3.1 compute shaders, things become a bit more muddled. A compute shader is able to access memory via SSBOs/Image Load/Store in similar ways to OpenCL compute operations (though OpenCL offers actual pointers, while GLSL does not). Their interop with OpenGL is also much faster than OpenCL/GL interop.
Even so, compute shaders do not change one fact: OpenCL compute operations operate at a very different precision than OpenGL's compute shaders. GLSL's floating-point precision requirements are not very strict, and OpenGL ES's are even less strict. So if floating-point accuracy is important to your calculations, OpenGL will not be the most effective way of computing what you need to compute.
Also, OpenGL compute shaders require 4.x-capable hardware, while OpenCL can run on much more inferior hardware.
Furthermore, if you're doing compute by co-opting the rendering pipeline, OpenGL drivers will still assume that you're doing rendering. So it's going to make optimization decisions based on that assumption. It will optimize the assignment of shader resources assuming you're drawing a picture.
For example, if you're rendering to a floating-point framebuffer, the driver might just decide to give you an R11_G11_B10 framebuffer, because it detects that you aren't doing anything with the alpha and your algorithm could tolerate the lower precision. If you use image load/store instead of a framebuffer however, you're much less likely to get this effect.
OpenCL is not a graphics API; it's a computation API.
Also, OpenCL just gives you access to more stuff. It gives you access to memory levels that are implicit with regard to GL. Certain memory can be shared between threads, but separate shader instances in GL are unable to directly affect one-another (outside of Image Load/Store, but OpenCL runs on hardware that doesn't have access to that).
OpenGL hides what the hardware is doing behind an abstraction. OpenCL exposes you to almost exactly what's going on.
You can use OpenGL to do arbitrary computations. But you don't want to; not while there's a perfectly viable alternative. Compute in OpenGL lives to service the graphics pipeline.
The only reason to pick OpenGL for any kind of non-rendering compute operation is to support hardware that can't run OpenCL. At the present time, this includes a lot of mobile hardware.
One notable feature would be scattered writes, another would be the absence of "Windows 7 smartness". Windows 7 will, as you probably know, kill the display driver if OpenGL does not flush for 2 seconds or so (don't nail me down on the exact time, but I think it's 2 secs). This may be annoying if you have a lengthy operation.
Also, OpenCL obviously works with a much greater variety of hardware than just the graphics card, and it does not have a rigid graphics-oriented pipeline with "artificial constraints". It is easier (trivial) to run several concurrent command streams too.
Although currently OpenGL would be the better choice for graphics, this is not permanent.
It could be practical for OpenGL to eventually merge as an extension of OpenCL. The two platforms are about 80% the same, but have different syntax quirks, different nomenclature for roughly the same components of the hardware. That means two languages to learn, two APIs to figure out. Graphics driver developers would prefer a merge because they no longer would have to develop for two separate platforms. That leaves more time and resources for driver debugging. ;)
Another thing to consider is that the origins of OpenGL and OpenCL are different: OpenGL began and gained momentum during the early fixed-pipeline-over-a-network days and was slowly appended and deprecated as the technology evolved. OpenCL, in some ways, is an evolution of OpenGL in the sense that OpenGL started being used for numerical processing as the (unplanned) flexibility of GPUs allowed so. "Graphics vs. Computing" is really more of a semantic argument. In both cases you're always trying to map your math operations to hardware with the highest performance possible. There are parts of GPU hardware which vanilla CL won't use but that won't keep a separate extension from doing so.
So how could OpenGL work under CL? Speculatively, triangle rasterizers could be enqueued as a special CL task. Special GLSL functions could be implemented in vanilla OpenCL, then overridden to hardware accelerated instructions by the driver during kernel compilation. Writing a shader in OpenCL, pending the library extensions were supplied, doesn't sound like a painful experience at all.
To call one to have more features than the other doesn't make much sense as they're both gaining 80% the same features, just under different nomenclature. To claim that OpenCL is not good for graphics because it is designed for computing doesn't make sense because graphics processing is computing.
Another major reason is that OpenGL\GLSL are supported only on graphics cards. Although multi-core usage started with using graphics hardware there are many hardware vendors working on multi-core hardware platform targeted for computation. For example see Intels Knights Corner.
Developing code for computation using OpenGL\GLSL will prevent you from using any hardware that is not a graphics card.
Well as of OpenGL 4.5 these are the features OpenCL 2.0 has that OpenGL 4.5 Doesn't (as far as I could tell) (this does not cover the features that OpenGL has that OpenCL doesn't):
Events
Better Atomics
Blocks
Workgroup Functions:
work_group_all and work_group_any
work_group_broadcast:
work_group_reduce
work_group_inclusive/exclusive_scan
Enqueue Kernel from Kernel
Pointers (though if you are executing on the GPU this probably doesn't matter)
A few math functions that OpenGL doesn't have (though you could construct them yourself in OpenGL)
Shared Virtual Memory
(More) Compiler Options for Kernels
Easy to select a particular GPU (or otherwise)
Can run on the CPU when no GPU
More support for those niche hardware platforms (e.g. FGPAs)
On some (all?) platforms you do not need a window (and its context binding) to do calculations.
OpenCL allows just a bit more control over precision of calculations (including some through those compiler options).
A lot of the above are mostly for better CPU - GPU interaction: Events, Shared Virtual Memory, Pointers (although these could potentially benefit other stuff too).
OpenGL has gained the ability to sort things into different areas of Client and Server memory since a lot of the other posts here have been made.
OpenGL has better memory barrier and atomics support now and allows you to allocate things to different registers within the GPU (to about the same degree OpenCL can). For example you can share registers in the local compute group now in OpenGL (using something like the AMD GPUs LDS (local data share) (though this particular feature only works with OpenGL compute shaders at this time).
OpenGL has stronger more performing implementations on some platforms (such as Open Source Linux drivers).
OpenGL has access to more fixed function hardware (like other answers have said). While it is true that sometimes fixed function hardware can be avoided (e.g. Crytek uses a "software" implementation of a depth buffer) fixed function hardware can manage memory just fine (and usually a lot better than someone who isn't working for a GPU hardware company could) and is just vastly superior in most cases. I must admit OpenCL has pretty good fixed function texture support which is one of the major OpenGL fixed function areas.
I would argue that Intels Knights Corner is a x86 GPU that controls itself.
I would also argue that OpenCL 2.0 with its texture functions (which are actually in lesser versions of OpenCL) can be used to much the same performance degree user2746401 suggested.
In addition to the already existing answers, OpenCL/CUDA not only fits more to the computational domain, but also doesn't abstract away the underlying hardware too much. This way you can profit from things like shared memory or coalesced memory access more directly, which would otherwise be burried in the actual implementation of the shader (which itself is nothing more than a special OpenCL/CUDA kernel, if you want).
Though to profit from such things you also need to be a bit more aware of the specific hardware your kernel will run on, but don't try to explicitly take those things into account using a shader (if even completely possible).
Once you do something more complex than simple level 1 BLAS routines, you will surely appreciate the flexibility and genericity of OpenCL/CUDA.
The "feature" that OpenCL is designed for general-purpose computation, while OpenGL is for graphics. You can do anything in GL (it is Turing-complete) but then you are driving in a nail using the handle of the screwdriver as a hammer.
Also, OpenCL can run not just on GPUs, but also on CPUs and various dedicated accelerators.
OpenCL (in 2.0 version) describes heterogeneous computational environment, where every component of system can both produce & consume tasks, generated by other system components. No more CPU, GPU (etc) notions are longer needed - you have just Host & Device(s).
OpenGL, in opposite, has strict division to CPU, which is task producer & GPU, which is task consumer. That's not bad, as less flexibility ensures greater performance. OpenGL is just more narrow-scope instrument.
One thought is to write your program in both and test them with respect to your priorities.
For example: If you're processing a pipeline of images, maybe your implementation in openGL or openCL is faster than the other.
Good luck.

Cuda 4.0 vs 3.2

Is CUDA 4.0 faster than 3.2?
I am not interested in the additions of CUDA 4.0 but rather in knowing if memory allocation and transfer will be faster if I used CUDA 4.0.
Thanks
Memory allocation and transfer depend more (if not exclusively) on the hardware capabilities (more efficient pipelines, size of cache), not the version of CUDA.
Even while on CUDA 3.2, you can install the CUDA 4.0 drivers (270.x) -- drivers are backward compatible. So you can test that apart from re-compiling your application. It is true that there are driver-level optimizations that affect run-time performance.
While generally that has worked fine on Linux, I have noticed some hiccups on MacOSX.
Yes, I have a fairly substantial application which ran ~10% faster once I switched from 3.2 to 4.0. This is without any code changes to take advantage of new features.
I also have a GTX480 if that matters any.
Note that the performance gains may be due to the fact that I'm using a newer version of dev drivers (installed automatically when you upgrade). I image nVidia may well be tweaking for CUDA performance the way they do for blockbuster games like Crysis.
Performance of memory allocation mostly depends on host platform (because the driver models differ) and driver implementation. For large amounts of device memory, allocation performance is unlikely to vary from one CUDA version to the next; for smaller amounts (say less than 128K), policy changes in the driver suballocator may affect performance.
For pinned memory, CUDA 4.0 is a special case because it introduced some major policy changes on UVA-capable systems. First of all, on initialization the driver does some huge virtual address reservations. Secondly, all pinned memory is portable, so must be mapped for every GPU in the system.
Performance of PCI Express transfers is mostly an artifact of the platform, and usually there is not much a developer can do to control it. (For small CUDA memcpy's, driver overhead may vary from one CUDA version to another.) One issue is that on systems with multiple I/O hubs, nonlocal DMA accesses go across the HT/QPI link and so are much slower. If you're targeting such systems, use NUMA APIs to steer memory allocations (and threads) onto the same CPU that the GPU is plugged into.
The answer is Yes because CUDA 4.0 reduce the system memory usage and the CPU memcpy() overhead