I'm looking at optimising some code with NVidia's OpenVX, and from previous experience with the CUDA API, GPU memory allocation is always a significant overhead.
So, I have a series of cv::Mat from video that I want to copy into an image; the naive code is of course:
vxImage = nvx_cv::createVXImageFromCVMat(context, cvMat);
The optimisation would be to allocate a single image, then just copy the bits on top. Looking at the header files (documentation is rather scant) I find:
nvx_cv::copyCVMatToVXMatrix(vxImage, cvMat);
However, the name is VXMatrix, so the compiler complains about a mismatch between the vx_matrix and vx_image types, of course. As far as I can tell, there is no copyCVMatToVXImage API; am I missing something, or is there another way to do this?
Related
I have an OpenCV CV_16UC3 matrix in which only the lower 8Bit per channel are occupied. I want to create a CV_8UC3 from it. Currently I use this method:
cv::Mat mat8uc3_rgb(imgWidth, imgHeight, CV_8UC3);
mat16uc3_rgb.convertTo(mat8uc3_rgb, CV_8UC3);
This has the desired result, but I wonder if it can be faster or more performant somehow.
Edit:
The entire processing chain consists of only 4 sub-steps (computing time framewise determined by QueryPerformanceCounter measurement on video scene)
mount raw byte buffer in OpenCV-Mat:
cv::Mat mat16uc1_bayer(imgHeight, RawImageWidth, CV_16UC1, (uint8*)payload);
De-Mosaiking
-> cv::cvtColor(mat16uc1_bayer, mat16uc3_rgb, cv::COLOR_BayerGR2BGR);
needs 0.008808[s]
pixel shift (only 12 of the 16 bits are occupied, but we only need 8 of them)
-> uses openCV parallel access to the pixels using mat16uc3_rgb.forEach<>
needs 0.004927[s]
conversion from CV_16UC3 to CV_8UC3
mat16uc3_rgb.convertTo(mat8uc3_rgb, CV_8UC3);
needs 0.006913[s]
I think I won't be able to do without the conversion of the raw buffer into CvMat or demosaiking. The pixel shift probably won't accelerate any further (here the parallelized forEach() is already used). I hoped that when converting from CV_8UC3 to CV_16UC3 an update of the matrix header info or similar would be possible, because the matrix data is already correct and doesn't have to be scaled anymore or similar.
I think you can safely assume that cv::Mat::convertTo is the fastest possible implementation of that operation.
Seeing you are going from one colorspace to another, it will likely not be a zero-cost operation. Memory copy is required for rearranging.
If you are designing a very high-performance system, you should do in-depth analysis of your bottlenecks, and redesign you system to minimize them. Ask yourself: is this conversion really required at this point? Can I solve it by making a custom function that integrates multiple operations in one? Can I use CPU parallelism extensions, multithreading or GPU acceleration? Etc.
Cross post here
I have such simple code:
//OpenCV 3.3.1 project
#include<opencv.hpp>
#include<iostream>
using namespace std;
using namespace cv;
Mat removeBlackEdge(Mat img) {
if (img.channels() != 1)
cvtColor(img, img, COLOR_BGR2GRAY);
Mat bin, whiteEdge;
threshold(img, bin, 0, 255, THRESH_BINARY_INV + THRESH_OTSU);
rectangle(bin, Rect(0, 0, bin.cols, bin.rows), Scalar(255));
whiteEdge = bin.clone();
floodFill(bin, Point(0, 0), Scalar(0));
dilate(whiteEdge - bin, whiteEdge, Mat());
return whiteEdge + img;
}
int main() { //13.0M
Mat emptyImg = imread("test.png", 0); //14.7M
emptyImg = removeBlackEdge(emptyImg); //33.0M
waitKey();
return 0;
}
And this is my test.png.
As the Windows show, it's size is about 1.14M. Of course I know it will be larger when it is read in memory. But I fill cannot accept the huge consumption of memory.
If I use F10, I can know the project.exe take up 13.0M. I have nothing to say about it
When I run Mat emptyImg = imread("test.png", 0);, the consumption of memory is 14.7M. It is normal also.
But why when I run emptyImg = removeBlackEdge(emptyImg);, the consumption of memory is up to 33.0M?? It mean function removeBlackEdge cost my extra 18.3M memory. Can anybody tell me some thing? I cannot totally accept such huge consumption of memory. Is there I miss something? I use new emptyImg to replace the old emptyImg. I think I don't need extra memory to cost?
Ps: If I want implement to do same thing by my removeBlackEdge( delete that black edge in the image), how to adjust it to cost minimum consumption of memory?
Below - guesswork and hints. This is a large comment. Please dont mark it as answer even if it was helpful in finding/solving the problem. Rather than that, if you manage to find the cause, write your own answer instead. I'll probably remove this one later.
Most probably, image/bitmap operations like threshold(img, bin, ..) create copies of the original image. Hard to say how many, but it surely creates at least one, as you can see by the second bin variable. Then goes clone and whiteEdge, so third copy. Also, operations like whiteEdge - bin probably create at least one copy (result) as well. rectangle, dilate and floodfill probably work in-place and use one of the Mat variables passed as workarea and output at the same time.
That means you have at least 4 copies of the image, probably a bit more, but let's say it's 4 * 1.7mb, so ~7mb out of 19mb increase you have observed.
So, we've got 12mb to explain, or so it seems.
First of all, it's possible that the image operations allocate some extra buffers and keep for later reuse. That costs memory, but it's "amortized", so it wont cost you again if you do these operations again.
It's also possible that at the moment those image operations were called, that fact might have caused some dynamic libraries to be loaded. That's also a one-time operation that won't cost you the memory if used again.
Next thing to note is that the process and the memory usage as reported by Windows by simple tools is .. not accurate. As the process allocates and relases the memory, the memory consumption reported by Windows usually only increases, and does not decrease upon release. At least not immediatelly. The process may keep some "air" or "vacuum". There are many causes to this, but the point is that this "air" is often reusable when the program tries to allocate memory again. A process, with base memory usage of 30mb, that periodically allocates 20mb and releases 20mb may be displayed by Windows as always taking up 50mb.
Having said that, I have to say I actually doubt in your memory measurements methodology, or at least conclusions from observations. After a single run of your removeBlackEdge you cannot say it consumed that amount of the memory. What you observed is only that the process' working memory pool has grown by that much. This itself does not tell you much.
If you suspect that temporary copies of the image are taking up too much space, try getting rid of them. This may mean things as obvious as writing the code to use less Mat variables and temporaries, or reusing/deallocating bitmaps that are no longer needed and just wait until function ends, or less obvious things like selecting different algorithms or writing your own. You may also try confirming of that's the case by running the program several times with input images of different sizes, and plotting a chart of observed-memory--vs--input-
size. If it grows linearly (i.e. "memory consumption" is always ~10x input size) then probably that's really some copies.
On the other hand, if you suspect a memory leak, run that removeBlackEdge several times. Hundreds or thousands or millions of times. Observe if the "memory consumption" steadily grows overtime or not. If it does, then probably there's a leak. On the other hand, if it grows only once at start and then keeps steady at the same level, then probably there's nothing wrong and it was only some one-time initializations or amortized caches/workspaces/etc.
I'd suggest you do those two tests now. Further work and analysis depends on what you observe during such a long run. Also, I have to note that this piece of code is relatively short and simple. Aren't you optimising a bit too soon? sidenote - be sure to turn on proper optimisations (speed/memory) in the compiler. If you happen to use and observe a "debug" version, than you can dismiss your speed/memory observations immediatelly.
A lot of OpenCV functions are defined as
function(InputArray src, OutputArray dst, otherargs..)
So if I want to process and overwrite the same image, can I do this:
function(myImg, myImg);
is it safe to do this way?
Thanks
Edit:
I'm asking for the standard functions in OpenCV like threshold, blur etc. So I think they should have been implemented accordingly, right?
Yes, in OpenCV it is safe.
Internally, a function like:
void somefunction(InputArray _src, OutputArray _dst);
will do something like:
Mat src = _src.getMat();
_dst.create( src.size(), src.type() );
Mat dst = _dst.getMat();
// dst filled with values
So, if src and dst are:
the same image, create won't actually do anything, and the modifications are effectively in-place. Some functions may clone the src image internally if the operation cannot be in-place (e.g. findConturs in OpenCV > 3.2) to guarantee the correct behavior.
different images, create will create a new matrix dst without modifying src.
Documentation states where this default behavior doesn't hold.
A notable example is findContours, that modify the src matrix. You cope with this usually passing src.clone() in input, so that only the cloned matrix is modified, but not the one you cloned from.
From OpenCV 3.2, findContours doesn't modify the input image.
Thanks to Fernando Bertoldi for reviewing the answer
EDIT: Now that the question was updated, I realize this is rather irrelevant. I'll leave it here, however, in case someone searching for a related issue comes along.
In general with C++, whether that situation is safe really depends on the body of the function in question. If you are reading from and writing to the same variable directly, you could wind up with some serious logic issues.
However, if you are using a temporary variable to hold the original value before overwriting it, it should be fine.
A MASSIVE word of warning if you're working with arrays, however. If you are trying to store the entire contents of the array in a temporary variable, you have to be careful you're storing the actual array, and not just the pointer to it. In many situations, it would generally be advisable to store individual values in the array temporarily (such as in a swap function). I can't give much further advice in this regard, however, as it all depends on what you're trying to do.
In short, it all depends on your function's implementation.
I'm using the Armadillo library in C++ for storing / calculating large matrices. It is my understanding that one should store large arrays / matrices dynamically (on the heap).
Suppose I declare a matrix
mat X;
and set the size to be (say) 500 rows, 500 columns with random entries:
X.randn(500,500);
Does Armadillo store X dynamically (i.e. on the heap) despite not using new or delete.? The reason I ask, is because it seems Armadillo allows me to declare a variable as:
mat::fixed<n_rows, n_cols>
which, I quote: "is generally faster than dynamic memory allocation, but the size of the matrix can't be changed afterwards (directly or indirectly)".
Regardless of the above -- should I use this:
mat A;
A.set_size(n-1,n-1);
or this:
mat *A = new mat;
(*A).set_size(n-1,n-1);
where n is between 1000 or 100000 and not known in advance.
Does Armadillo store X dynamically (i.e. on the heap) despite not
using new or delete.?
Yes. There will be some form of new or delete in the library code. You just don't notice it from the outside.
The reason I ask, is because it seems Armadillo
allows me to declare a variable as (mat::fixed ...)
You'd have to look into the source code to see what's going on exactly here. My guess is that it has some kind of internal logic that decides how to deal with things based on size. You would normally use mat::fixed for small matrices, though.
Following that, you should use
mat A(n-1,n-1);
if you know the size at that point already. In some cases,
mat A;
A.set_size(n-1,n-1);
might also be okay.
I can't think of a good reason to use your second option with the mat * pointer. First of all, libraries like armadillo handle their memory allocations internally, and developers take great care to get it right. Also, even if the memory code in the library was broken, your idea new mat wouldn't fix it: You would allocate memory for a mat object, but that object is certainly rather small. The big part is probably hidden behind something like a member variable T* data in the class mat, and you cannot influence how this is allocated from the outside.
I initially missed your comment on the size of n. As Mikhail says, dealing with 100000x100000 matrices will require much more care than simply thinking about the way you instantiate them.
I have a question related to the implementation of image interpolation (bicubic and bilinear methods) with C++. My main concern is speed. Based on my understanding of the problem, in order to make the interpolation program fast and efficient, the following strategies can be adopted:
Fast image interpolation using Streaming SIMD Extensions (SSE)
Image interpretation with multi-thread or GPU
Fast image interpolation algorithms
C++ implementation tricks
Here, I am more interested in the last strategy. I set up a class for interpolation:
/**
* This class is used to perform interpretaion for a certain poin in
* the image grid.
*/
class Sampling
{
public:
// samples[0] *-------------* samples[1]
// --------------
// --------------
// samples[2] *-------------*samples[3]
inline void sampling_linear(unsigned char *samples, unsigned char &res)
{
unsigned char res_temp[2];
sampling_linear_1D(samples,res_temp[0]);
sampling_linear_1D(samples+2,res_temp[1]);
sampling_linear_1D(res_temp,res);
}
private:
inline void sampling_linear_1D(unsigned char *samples, unsigned char &res)
{
}
}
Here I only give an example for bilinear interpolation. In order to make the program run faster, the inline function is employed. My question is whether this implementation scheme is efficient. Additionally, during the interpretation procedure if I give the use the option of choosing between different interpolation methods. Then I have two choices:
Depending on the interpolation method, invoke the function the perform interpolation for the whole image.
For each output image pixel, first determine its position in the input image, and then according to the interpolation method setting, determine the interpolation function.
The first method means more codes in the program while the second one may lead to inefficiency. Then, how could I choose between these two schemes? Thanks!
Fast image interpolation using Streaming SIMD Extensions (SSE)
This may not provide desired result, because I expect that your algorithm will be memory-bounded rather than FLOP/s bounded.
I mean - it definitely will be improvement, but not beneficial in compare to implementation cost.
And by the way, modern compilers can perform auto-vectorization (i.e. use of SSE and futher extensions): GCC starting from 4.0, MSVC starting from 2012, MSVC Auto-Vectorization video lectures.
Image interpretation with multi-thread or GPU
Multi-thread version should give good effect, because it would allow you to exploit all available memory throughput.
If you do not plan to process data several times, or use it in some way on GPU, then GPGPU may not give desired result. Yes, it will produce result faster (mostly due to higher memory speed), but this effect will be crossed out by slow transfer between main RAM and GPU's RAM.
Just for example, approximate modern throughputs:
CPU RAM ~ 20GiB/s
GPU RAM ~ 150GiB/s
Transfering between CPU RAM <-> GPU RAM ~ 3-5 GiB/s
For single pass memory bounded algorithms, in most cases, third item makes usage of GPUs impractical (for such algoirthms).
In order to make the program run faster, the inline function is employed
Class member functions are "inline" by default. Beaware, that main purpose of "inline" is not actually "inling", but helping to prevent One Definition Rule violation when your functions are defined in headers.
There are compiler-dependent "forceinline" features, for instance MSVC has __forceinline. Or abstracted from compiler ifdef'ed BOOST_FORCEINLINE macro.
Anyway, trust your compiler unless you don't prove otherwise (with help of assembler for example). Most important fact, is that compiler should see functions defenitions - then it can decide itself to inline, even if function is not inline itself.
My question is whether this implementation scheme is efficient.
As I understand, as pre-step, you gather samples into 2x2 matrix. I think it may be better to pass directly two pointers to arrays of two elements within image directly, or one pointer + width size (to calc second pointer automaticly). However, it is not a big issue, most likely your temporary 2x2 matrix will be optimized away.
What is really important - is how you traverse your image.
Let's say for given x and y, index is calculated as:
i=width*y+x;
Then your traversal loop should be:
for(int y=/*...*/)
for(int x=/*...*/)
{
// loop body
}
Because, if you would chose another order (x first, then y) - it will be not cache-friendly, and as the result performance drop can be up to 64x (depending on your pixel size). You may check it just for your interest.
The first method means more codes in the program while the second one may lead to inefficiency. Then, how could I choose between these two schemes? Thanks!
In this case, you can use compile-time polymorphism to reduce code amount in first version. For instance, based on templates.
Just look at std::accumulate - it can be written once, and then it will work on different types of iterators, different binary operations (functions or functors), without imply any runtime penalty due to it's polymorphism.
Alexander Stepanov says:
For many years, I tried to achieve relative efficiency in more advanced languages (e.g., Ada and Scheme) but failed. My generic versions of even simple algorithms were not able to compete with built-in primitives. But in C++ I was finally able to not only accomplish relative efficiency but come very close to the more ambitious goal of absolute efficiency. To verify this, I spent countless hours looking at the assembly code generated by different compilers on different architectures.
Check Boost's Generic Image Library - it has good tutorial, and there is video presentation from author.