No performance gain with OpenCV 3.2 on OpenCL (TAPI) - c++

Absolute TAPI beginner here. I recently ported my CV code to make use of UMat intstead of Mat since my CPU was on it's limit, especially morphologic operations seemed to consume quite some computing power.
Now with UMat I cannot see any changes in my framerate, it is exactly the same no matter if I use UMat or not, also Process Explorer reports no GPU usage whatsoever. I did a small test with a few calls of dilation and closing on a full HD image -- no effect.
Am I missing something here? I'm using the latest OpenCV 3.2 build for Windows and a GTX 980 with driver 378.49. cv::ocl::haveOpenCL() and cv::ocl::useOpenCL() both return true and cv::ocl::Context::getDefault().device( 0 ) also gives me the correct device, everything looks good as far as I can tell. Also, I'm using some custom CL code via cv::ocl::Kernel which is definitely invoked.
I realize it is a naive way of thinking that just changing Mat to UMat will result in huge performance gain (although every of the the very limited number of ressources covering TAPI I find online suggests exactly that). Still, I was hoping to get some gain for starters and then further optimize step-by-step from there. However, the fact that I can't discover any GPU usage whatsoever highliy irritates me.
Is there something I have to watch out for? Maybe my usage of TAPI prevents from a streamlined execution of OpenCL code, maybe by accidental/hidden readbacks I'm not aware of? Do you see any way of profiling the code with respect to that matter?
Are there any how-to's, best practices or common pitfalls for using TAPI? Things like "don't use local UMat instances within functions", "use getUMat() instead of copyTo()", "avoid calls of function x since it will cause a cv::ocl::flush()", things of that sort?
Are there OpenCV operations that are not ported to OpenCL yet? Is there documentation accordingly? In the OpenCV source code I saw that, if built with HAVE_OPENCL flat, the functions try to run CL code using the CV_OCL_RUN macro, however there are a few conditions checked beforehand, otherwise it falls back to CPU. It does not seem like I have any possitility to figure out if the GPU or the CPU was actually used apart from stepping into each and every OpenCL function with the debugger, am I right?
Any ideas/experiences apart from that? I'd appreciate any input that relates to this matter.

Related

Visual C++ | How to benchmark EVERY FUNCTION and log output? [duplicate]

I've used a few profilers in the past and never found them particularly easy. Maybe I picked bad ones, maybe I didn't really know what I was expecting!
But I'd like to know if there are any 'standard' profilers which simply drop in and work? I don't believe I need massively fine-detailed reports, just to pick up major black-spots. Ease of use is more important to me at this point.
It's VC++ 2008 we're using (I run standard edition personally). I don't suppose there are any tools in the IDE for this, I can't see any from looking at the main menus?
I suggest a very simple method (which I learned from reading Mike Dunlavey's posts on SO):
Just pause the program.
Do it several times to get a reasonable sample. If a particular function is taking half of your program's execution time, the odds are that you will catch it in the act very quickly.
If you improve that function's performance by 50%, then you've just improved overall execution time by 25%. And if you discover that it's not even needed at all (I have found several such cases in the short time I've been using this method), you've just cut the execution time in half.
I must confess that at first I was quite skeptical of the efficacy of this approach, but after trying it for a couple of weeks, I'm hooked.
VS built in:
If you have team edition you can use the Visual Studio profiler.
Other options:
Otherwise check this thread.
Creating your own easily:
I personally use an internally built one based on the Win32 API QueryPerformanceCounter.
You can make something nice and easy to use within a hundred lines of code or less.
The process is simple: create a macro at the top of each function that you want to profile called PROFILE_FUNC() and that will add to internally managed stats. Then have another macro called PROFILE_DUMP() which will dump the outputs to a text document.
PROFILE_FUNC() creates an object that will use RAII to log the amount of time until the object is destroyed. Both the constructor of this RAII object and the destructor will call QueryPerformanceCounter. You could also leave these lines in your code and control the behavior via a #define PROFILING_ON
I always used AMD CodeAnalyst, I find it quite easy to use and gives interesting results. I always used the time based profile, in which I found that it cooperates well with my apps' debug information, letting me find where the time is spent at procedure, C++ instruction and single assembly instruction level.
I used lt prof in the past for a quick run down of my C++ app. It works pretty easy and runs with a compiled program, does not need and source code hooks or tweaks. There is a trial version available I believe.
A very simple (and free) way to profile is to install the Windows debuggers (cdb/windbg), set a bp on the place of interest, and issue the wt command ("Trace and Watch Data"). Check out MSDN for more info.
Another super simple and useful profiling workflow that works on any programming languages is to comment out blocks of codes. After commenting out all of them, uncomment some and run your program to see the performance. If your program starts to run very slow when some code has been uncommented, then you'll probably want to check the performance there.

What is the macro CV_OCL_RUN used for in OpenCV?

I was learning hog.cpp implemented in OpenCV, when encountered the macro CV_OCL_RUN and confused with it.
In hog.cpp where detectMultiScale() locates, you can find CV_OCL_RUN and a method called ocl_detectMultiScale() in it. Compared between detectMultiScale() and ocl_detectMultiScale(), not only their names but their implement are quite similar.
Here are my questions:
What is the macro CV_OCL_RUN used for? Does it for test or other purpose?
Since detectMultiScale() and ocl_detectMultiScale() are so similar in functionality, why the later is embedded in the former ? What ways are they called in?
Thanks in advance!
CV_OCL_RUN is for OpenCL code.
If your computer is not able to use OpenCL capabilities (no GPU or no OpenCL driver), regular code (CPU) is run. You can also switch between regular code or use OpenCL version in the code. If setUseOptimized() or setUseOpenCL() is set to false, regular code will be used.
You can find in the opencl directory the kernel code which will be run on the GPU device.
PS: OpenCL is not only for GPU.

Matlab to C or C++

I am working on an image processing project using Matlab. We should run our program (intended to be an application) on a cell phone.We were then asked to convert our code into C or C++ language so we get a feel of how long it would take for execution and then choose a platform. So far we didn't figure out how to do this conversion.. Any ideas of what to do to convert Matlab to C or C++??
The first thing you need to realise is that porting code from one language to another (especially languages as different as Matlab and C++) is generally non-trivial and time-consuming. You need to know both languages well, and you need to have similar facilities available in both. In the case of Matlab and C++, Matlab gives you a lot of stuff that you just won't have available in C++ without using libraries. So the first thing to do is identify which libraries you're going to need to use in C++. (You can write some of the stuff yourself, but you'll be there a long time if you write all of it yourself.)
If you're doing image processing, I highly recommend looking into something like ITK at http://www.itk.org -- I've written my image processing software twice in C++, once without ITK (coding everything myself) and once with, and the version that used ITK was finished faster, performed better and was ten times more fun to work on. FWIW.
Matlab can gererate C code for you.
See:
http://www.mathworks.com/products/featured/embeddedmatlab/
The generated code does however depend on matlab libraries. So you probably can't use it for a cell phone. But it might save you some time anyways.
I also used the MATLAB Coder to convert some functions consisting of a few hundred lines of MATLAB into C. This included using MATLAB's eigenvalue solver and matrix inversion functions.
Although Coder was able to produce C code (which theoretically was identical), it was very convoluted, bloated, impossible to decipher, and appeared to be extremely inefficient. It literally created about 10x as many lines of code as it should have needed. I ended up converting it all by hand so that I would actually be able to comprehend the C code later and make further changes/updates. This task however, can be very tedious/dangerous, as the array indexing in Matlab is 1-based and in C it's 0-based. You're likely to add bugs into the code, as I experienced. you'll also have to convert any vector/matrix arithmetic into loops that handle scalars (or use some type of C matrix algebra package)
The MathWorks provides a product called MATLAB Coder that claims to generate "readable and portable C and C++ code from MATLABĀ® code". I haven't tried it myself, so I can't comment on how well it accomplishes these goals.
With regard to the Image Processing Toolbox, this list (presumably for R2016b) shows which functions have been enabled for code generation and any limitations they may have.
Matlab has a tool called "Matlab Coder" which can convert your matlab file to C code or mex file. My code is relatively simple so it works fine. Speed up gain is about 10 times faster. This saves me time coding a few hundreds lines. Hope it's helpful for you too
Quick Start Guide for MATLAB Coder Confirmation
The links describe the process of converting your code in 3 major steps:
First you need to make a few simplifications in your present code so that it would be simple enough for the Coder to translate.
Second, you will use the tool to generate a mex file and test if everything is actually working.
Finally you would change some setting and generate the C code. In my case, the C code has about 700 lines including all the original matlab code (about 150 lines) as comments. I think it's quite readable and could be improve upon. However, I already get a 10 times speed up gain from the mex file anyway. So this is definitely a good thing.
We can't not be sure that this will work in all case but it's definitely worth trying.
I remember there is a tool to export m-files as c(++)-files. But I could never get that running. You need to add some obscure MATLAB-headers in the c/c++code, ... And I think it is also not recommended.
If you have running MATLAB-code, it shouldn't take too much effort to do the conversion "by hand". I have been working on several project where MATLAB was used and it was never consider to use any tools to convert the code to C/C++. It was always done "by hand".
I believe to have been the only one who ever investigate into using a tool.
Well there is not straight conversion from matlab to c/c++ You will need to understand the language and the differences between matlab and c/c++ and then start coding it in c/c++. Code a little test a little until it works.

Profiling line-by-line in C++

I have a C++ program I am trying to optimize.
Since I want it to run fast, I am not using a lot of function calls. Most profiling tool I have seen can give you profiling info in a function-call resolution. However, I would like it in a line-by-line resolution. Is there some option like this?
I am using Visual Studio 2010 on Windows.
Thanks.
Intel Parallel Amplifier should be capable of what you want. If that is what you want:
If you're running with on an AMD processor, CodeAnalyst is free and can do that (at least, in time-based profiling); you can actually "zoom" in and out seeing what is taking the most CPU time from processes to functions down to single assembly instructions.
However, keep in mind that to get meaningful results to that resolution with time-based profiling you should run the critical part of the code several times, otherwise the statistics you get doesn't have much sense.
By the way, in my opinion you should forget about the less function calls=>faster idea. If the cost of a function call is bigger than its "payload", the compiler should be able to figure out by itself if it's convenient to inline the call, and in some cases even inlining too much can slow down the code.
AQTime is a commercial profiler for Windows and I have found it to work pretty well for both function and line timings. One thing I like about it is that you do not have to fiddle with compiler options or Visual Studio settings -- i.e. you do not need any additional compiler options to enable profiling: All you need to do the profiling is the pdb (symbol) file and the executable. (And yes, you can create a pdb file for your release-compile.)
IMHO, this method is best, for these reasons, and here's an example of a 43x speedup. It's not a well-known technique, except to a small number of people, for one example, and another, and another. You may be surprised that it's very low-tech and manual, but you can't beat the results.
Oh, and by the way, for Visual Studio, LTProf may well be the next best thing. It gives you line-level percents, derived from stack samples taken at random wall-clock times. Don't get sucked in by a lot of fancy UI options or promises of accuracy of timing. Those things don't matter. What matters is that it pinpoints the spots worth optimizing.

What's a very easy C++ profiler (VC++)?

I've used a few profilers in the past and never found them particularly easy. Maybe I picked bad ones, maybe I didn't really know what I was expecting!
But I'd like to know if there are any 'standard' profilers which simply drop in and work? I don't believe I need massively fine-detailed reports, just to pick up major black-spots. Ease of use is more important to me at this point.
It's VC++ 2008 we're using (I run standard edition personally). I don't suppose there are any tools in the IDE for this, I can't see any from looking at the main menus?
I suggest a very simple method (which I learned from reading Mike Dunlavey's posts on SO):
Just pause the program.
Do it several times to get a reasonable sample. If a particular function is taking half of your program's execution time, the odds are that you will catch it in the act very quickly.
If you improve that function's performance by 50%, then you've just improved overall execution time by 25%. And if you discover that it's not even needed at all (I have found several such cases in the short time I've been using this method), you've just cut the execution time in half.
I must confess that at first I was quite skeptical of the efficacy of this approach, but after trying it for a couple of weeks, I'm hooked.
VS built in:
If you have team edition you can use the Visual Studio profiler.
Other options:
Otherwise check this thread.
Creating your own easily:
I personally use an internally built one based on the Win32 API QueryPerformanceCounter.
You can make something nice and easy to use within a hundred lines of code or less.
The process is simple: create a macro at the top of each function that you want to profile called PROFILE_FUNC() and that will add to internally managed stats. Then have another macro called PROFILE_DUMP() which will dump the outputs to a text document.
PROFILE_FUNC() creates an object that will use RAII to log the amount of time until the object is destroyed. Both the constructor of this RAII object and the destructor will call QueryPerformanceCounter. You could also leave these lines in your code and control the behavior via a #define PROFILING_ON
I always used AMD CodeAnalyst, I find it quite easy to use and gives interesting results. I always used the time based profile, in which I found that it cooperates well with my apps' debug information, letting me find where the time is spent at procedure, C++ instruction and single assembly instruction level.
I used lt prof in the past for a quick run down of my C++ app. It works pretty easy and runs with a compiled program, does not need and source code hooks or tweaks. There is a trial version available I believe.
A very simple (and free) way to profile is to install the Windows debuggers (cdb/windbg), set a bp on the place of interest, and issue the wt command ("Trace and Watch Data"). Check out MSDN for more info.
Another super simple and useful profiling workflow that works on any programming languages is to comment out blocks of codes. After commenting out all of them, uncomment some and run your program to see the performance. If your program starts to run very slow when some code has been uncommented, then you'll probably want to check the performance there.