Question about the consumption of memory in OpenCV - c++

Cross post here
I have such simple code:
//OpenCV 3.3.1 project
#include<opencv.hpp>
#include<iostream>
using namespace std;
using namespace cv;
Mat removeBlackEdge(Mat img) {
if (img.channels() != 1)
cvtColor(img, img, COLOR_BGR2GRAY);
Mat bin, whiteEdge;
threshold(img, bin, 0, 255, THRESH_BINARY_INV + THRESH_OTSU);
rectangle(bin, Rect(0, 0, bin.cols, bin.rows), Scalar(255));
whiteEdge = bin.clone();
floodFill(bin, Point(0, 0), Scalar(0));
dilate(whiteEdge - bin, whiteEdge, Mat());
return whiteEdge + img;
}
int main() { //13.0M
Mat emptyImg = imread("test.png", 0); //14.7M
emptyImg = removeBlackEdge(emptyImg); //33.0M
waitKey();
return 0;
}
And this is my test.png.
As the Windows show, it's size is about 1.14M. Of course I know it will be larger when it is read in memory. But I fill cannot accept the huge consumption of memory.
If I use F10, I can know the project.exe take up 13.0M. I have nothing to say about it
When I run Mat emptyImg = imread("test.png", 0);, the consumption of memory is 14.7M. It is normal also.
But why when I run emptyImg = removeBlackEdge(emptyImg);, the consumption of memory is up to 33.0M?? It mean function removeBlackEdge cost my extra 18.3M memory. Can anybody tell me some thing? I cannot totally accept such huge consumption of memory. Is there I miss something? I use new emptyImg to replace the old emptyImg. I think I don't need extra memory to cost?
Ps: If I want implement to do same thing by my removeBlackEdge( delete that black edge in the image), how to adjust it to cost minimum consumption of memory?

Below - guesswork and hints. This is a large comment. Please dont mark it as answer even if it was helpful in finding/solving the problem. Rather than that, if you manage to find the cause, write your own answer instead. I'll probably remove this one later.
Most probably, image/bitmap operations like threshold(img, bin, ..) create copies of the original image. Hard to say how many, but it surely creates at least one, as you can see by the second bin variable. Then goes clone and whiteEdge, so third copy. Also, operations like whiteEdge - bin probably create at least one copy (result) as well. rectangle, dilate and floodfill probably work in-place and use one of the Mat variables passed as workarea and output at the same time.
That means you have at least 4 copies of the image, probably a bit more, but let's say it's 4 * 1.7mb, so ~7mb out of 19mb increase you have observed.
So, we've got 12mb to explain, or so it seems.
First of all, it's possible that the image operations allocate some extra buffers and keep for later reuse. That costs memory, but it's "amortized", so it wont cost you again if you do these operations again.
It's also possible that at the moment those image operations were called, that fact might have caused some dynamic libraries to be loaded. That's also a one-time operation that won't cost you the memory if used again.
Next thing to note is that the process and the memory usage as reported by Windows by simple tools is .. not accurate. As the process allocates and relases the memory, the memory consumption reported by Windows usually only increases, and does not decrease upon release. At least not immediatelly. The process may keep some "air" or "vacuum". There are many causes to this, but the point is that this "air" is often reusable when the program tries to allocate memory again. A process, with base memory usage of 30mb, that periodically allocates 20mb and releases 20mb may be displayed by Windows as always taking up 50mb.
Having said that, I have to say I actually doubt in your memory measurements methodology, or at least conclusions from observations. After a single run of your removeBlackEdge you cannot say it consumed that amount of the memory. What you observed is only that the process' working memory pool has grown by that much. This itself does not tell you much.
If you suspect that temporary copies of the image are taking up too much space, try getting rid of them. This may mean things as obvious as writing the code to use less Mat variables and temporaries, or reusing/deallocating bitmaps that are no longer needed and just wait until function ends, or less obvious things like selecting different algorithms or writing your own. You may also try confirming of that's the case by running the program several times with input images of different sizes, and plotting a chart of observed-memory--vs--input-
size. If it grows linearly (i.e. "memory consumption" is always ~10x input size) then probably that's really some copies.
On the other hand, if you suspect a memory leak, run that removeBlackEdge several times. Hundreds or thousands or millions of times. Observe if the "memory consumption" steadily grows overtime or not. If it does, then probably there's a leak. On the other hand, if it grows only once at start and then keeps steady at the same level, then probably there's nothing wrong and it was only some one-time initializations or amortized caches/workspaces/etc.
I'd suggest you do those two tests now. Further work and analysis depends on what you observe during such a long run. Also, I have to note that this piece of code is relatively short and simple. Aren't you optimising a bit too soon? sidenote - be sure to turn on proper optimisations (speed/memory) in the compiler. If you happen to use and observe a "debug" version, than you can dismiss your speed/memory observations immediatelly.

Related

Quick'n'Dirty: Measure memory usage?

There's plenty of detailed instructions to measure C++ memory usage. (Example links at the bottom)
If I've got a program processing and displaying pixel data, I can use Windows Task Manager to spot memory leakage if processing/displaying/closing multiple files in succession means the executable's Memory (Private working set) grows with each iteration. Yes, not perfect but processing 1000s of frames of data, this works as a quick'n'dirty solution.
To chase down a memory bug in that (large) project, I wrote a program to accurately measure memory usage, using the Lanzelot's useful answer. Namely, the part titled "Total Physical Memory (RAM)". But if I calloc the size of 1 double, I get 577536. Even if that's a quote in bits, that's a lot..
I tried writing a bog standard program to pause, assign some memory (let's say calloc a Megabyte worth of data) and pause again before free'ing said memory. Pause are long enough to let me comfortably look at WTM. Except the executable only grows by 4 K(!) per memory assignment.
What am I missing here? Is QtCreator or the compiler optimising the assigned memory? Why does the big complex project seemingly allow memory usage resolution down to ~1MB, while whatever memory size I fruitlessly fiddle with in my simple program, bare moves the memory shown in Windows Task Manager at all?
C++: Measuring memory usage from within the program, Windows and Linux
https://stackoverflow.com/a/64166/2903608
--- Edit ---
Example code, as simple as:
double *pDouble = (double*) calloc(1, sizeof(double));
*pDouble = 5.0;
qDebug() << "*pDouble: " << *pDouble;
If I look through WTM, this takes 4K (whether 1, or 1000000 double's). With Lanzelot's solution, north of 500k..

C++ speed up method call

I am working on a very time consuming application and I want to speed it up a little. I analyzed the runtime of single parts using the clock() function of the ctime library and found something, which is not totally clear to me.
I have time prints outside and inside of a method, lets call it Method1. The print inside Method1 includes the whole body of it, only the return of a float is exluded of course. Well, the thing is, that the print outside states twice to three times the time of the print inside Method1. It's obvious, that the print outside should state more time, but the difference seems quite big to me.
My method looks as follows, I am using references and pointers as parameters to prevent copying of data. Note, that the data vector includes 330.000 pointers to instances.
float ClassA::Method1(vector<DataClass*>& data, TreeClass* node)
{
//start time measurement
vector<Mat> offset_vec_1 = vector<Mat>();
vector<Mat> offset_vec_2 = vector<Mat>();
for (int i = 0; i < data.size(); i++)
{
DataClass* cur_data = data.at(i);
Mat offset1 = Mat();
Mat offset2 = Mat();
getChildParentOffsets(cur_data, node, offset1, offset2);
offset_vec_1.push_back(offset1);
offset_vec_2.push_back(offset2);
}
float ret = CalculateCovarReturnTrace(offset_vec_1) + CalculateCovarReturnTrace(offset_vec_2);
//end time measurement
return ret;
}
Is there any "obvious" way to increase the call speed? I would prefer to keep the method for readability reasons, thus, can I change anything to gain a speed up?
I am appreciating any suggestions!
Based on your updated code, the only code between the end time measurement and the measurement after the function call is the destructors for constructed objects in the function. That being the two vectors of 330,000 Mats each. Which will likely take some time depending on the resources used by each of those Mats.
Without trying to lay claim to any of the comments made by others to the OP ...
(1) The short-answer might well be, "no." This function appears to be quite clear, and it's doing a lot of work 30,000 times. Then, it's doing a calculation over "all that data."
(2) Consider re-using the "offset1" and "offset2" matrices, instead of creating entirely new ones for each iteration. It remains to be seen, of course, whether this would actually be faster. (And in any case, see below, it amounts to "diddling the code.")
(3) Therefore, borrowing from The Elements of Programming Style: "Don't 'diddle' code to make it faster: find a better algorithm." And in this case, there just might not be one. You might need to address the runtime issue by "throwing silicon at it," and I'd suggest that the first thing to do would be to add as much RAM as possible to this computer. A process that "deals with a lot of data" is very-exposed to virtual memory page-faults, each of which requires on the order of *milli-*seconds to resolve. (Those hundredths of a second add-up real fast.)
I personally do not see anything categorically wrong with this code, nor anything that would categorically cause it to run faster. Nor would I advocate re-writing ("diddling") the code from the very-clear expression of it that you have right now.

Why Copying an instance of object in a loop takes a huge memory in C++?

I have written a program which works iteratively to find some solution. I initially used vectors to have instances of an object. It worked fine but I preferred to have and instance of the class as the primary object and a temp object which is made in a while loop through some kind of instance copying. It works fine but slower. it also occupies almost two times more RAM memory space. (e.g. 980 Mb first and after that change it take about 1.6 Gb) why? I really have no Idea. I Took the line of "copying" (which is not technically copy constructor but works the same way) out of loop and it works as expected with expected RAM usage, so the problem arises when the "copying line" is inside the loop. Any Idea why this happens?
a simplified preview of code:
void SPSA::beginIteration(Model ins, Inventory inv_Data, vector <Result> &res)
{
bool icontinue=true;
while(icontinue)
{
Model ins_temp(&ins, &inv_Data);
if(model_counter>0)
ins_temp.setDecesionVariableIntoModel(decisionVariable);
//something useful here
model_counter++;
}
}
The code above occupies a lot of RAM space.
but code below is ok:
void SPSA::beginIteration(Model ins, Inventory inv_Data, vector <Result> &res)
{
bool icontinue=true;
Model ins_temp(&ins, &inv_Data);
while(icontinue)
{
if(model_counter>0)
ins_temp.setDecesionVariableIntoModel(decisionVariable);
//something useful here
model_counter++;
}
}
By the way, I'm compiling using using mingw++.
Thanks
The main difference is copying the Model N times (once for each loop body execution, depending on when icontinue is set) rather than once.
First, try to reduce the problem:
while(1) Model ins_temp(&ins, &inv_Data);
If that eats memory as well (significantly), yet, it's a problem with Model.
(Above loop may eat a little memory due to fragmentation - depending on how Model is implemented.)
Most likely cause (without additional information) is a memory leak in Model.
Other possibilities include: Model uses a lazy memory release (such as / similar to a garbage collector), Model uses shared pointers and creating more than one instance causes circular references, or you are running into a very ugly, very bad memory fragmentation problem (extremely unlikely).
How to proceed
The "pro" solution would be using a memory profiler (options for mingw).
Alternatively, study your code of Model for leaks, or reduce the implementation of Model until you find the minimal change that makes the leak go away.
Don't be scared immediately if your code appears to use a lof of memory. It doesn't necessarily mean there's a memory leak. C++ may not always return memory to the OS, but keep it on hand for future allocations. It only counts as a memory leak if C++ loses track of the allocation status and cannot use that memory for future allocations anymore.
The fact that the memory doubles regardless of the iteration count suggests that the memory is recycled not on the very first opportunity, but on every second allocation. Presumably you end up with two big allocations which are used alternatingly.
It's even possible that if the second allocation had failed, the first block would have been recycled, so you use 1.6 GB of memory because it's there.

Is this normal behavior for a std::vector?

I have a std::vector of a class called OGLSHAPE.
each shape has a vector of SHAPECONTOUR struct which has a vector of float and a vector of vector of double. it also has a vector of an outline struct which has a vector of float in it.
Initially, my program starts up using 8.7 MB of ram. I noticed that when I started filling these these up, ex adding doubles and floats, the memory got fairly high quickly, then leveled off. When I clear the OGLSHAPE vector, still about 19MB is used. Then if I push about 150 more shapes, then clear those, I'm now using around 19.3MB of ram. I would have thought that logically, if the first time it went from 8.7 to 19, that the next time it would go up to around 30. I'm not sure what it is. I thought it was a memory leak but now I'm not sure. All I do is push numbers into std::vectors, nothing else. So I'd expect to get all my memory back. What could cause this?
Thanks
*edit, okay its memory fragmentation
from allocating lots of small things,
how can that be solved?
Calling std::vector<>::clear() does not necessarily free all allocated memory (it depends on the implementation of the std::vector<>). This is often done for the purpose of optimization to avoid unnessecary memory allocations.
In order to really free the memory held by an instance just do:
template <typename T>
inline void really_free_all_memory(std::vector<T>& to_clear)
{
std::vector<T> v;
v.swap(to_clear);
}
// ...
std::vector<foo> objs;
// ...
// really free instance 'objs'
really_free_all_memory(objs);
which creates a new (empty) instance and swaps it with your vector instance you would like to clear.
Use the correct tools to observe your memory usage, e.g. (on Windows) use Process Explorer and observe Private Bytes. Don't look at Virtual Address Space since that shows the highest memory address in use. Fragmentation is the cause of a big difference between both values.
Also realize that there are a lot of layers in between your application and the operating system:
the std::vector does not necessarily free all memory immediately (see tip of hkaiser)
the C Run Time does not always return all memory to the operating system
the Operating System's Heap routines may not be able to free all memory because it can only free full pages (of 4 KB). If 1 byte of a 4KB page is stil used, the page cannot be freed.
There are a few possible things at play here.
First, the way memory works in most common C and C++ runtime libraries is that once it is allocated to the application from the operating system it is rarely ever given back to the OS. When you free it in your program, the new memory manager keeps it around in case you ask for more memory again. If you do, it gives it back for you for re-use.
The other reason is that vectors themselves typically don't reduce their size, even if you clear() them. They keep the "capacity" that they had at their highest so that it is faster to re-fill them. But if the vector is ever destroyed, that memory will then go back to the runtime library to be allocated again.
So, if you are not destroying your vectors, they may be keeping the memory internally for you. If you are using something in the operating system to view memory usage, it is probably not aware of how much "free" memory is waiting around in the runtime libraries to be used, rather than being given back to the operating system.
The reason your memory usage increases slightly (instead of not at all) is probably because of fragmentation. This is a sort of complicated tangent, but suffice it to say that allocating a lot of small objects can make it harder for the runtime library to find a big chunk when it needs it. In that case, it can't reuse some of the memory it has laying around that you already freed, because it is in lots of small pieces. So it has to go to the OS and request a big piece.

Multithreading not taking advantage of multiple cores?

My computer is a dual core core2Duo. I have implemented multithreading in a slow area of my application but I still notice cpu usage never exceeds 50% and it still lags after many iterations. Is this normal? I was hopeing it would get my cpu up to 100% since im dividing it into 4 threads. Why could it still be capped at 50%?
Thanks
See What am I doing wrong? (multithreading)
for my implementation, except I fixed the issue that that code was having
Looking at your code, you are making a huge number of allocations in your tight loop--in each iteration you dynamically allocate two, two-element vectors and then push those back onto the result vector (thus making copies of both of those vectors); that last push back will occasionally cause a reallocation and a copy of the vector contents.
Heap allocation is relatively slow, even if your implementation uses a fast, fixed-size allocator for small blocks. In the worst case, the general-purpose allocator may even use a global lock; if so, it will obliterate any gains you might get from multithreading, since each thread will spend a lot of time waiting on heap allocation.
Of course, profiling would tell you whether heap allocation is constraining your performance or whether it's something else. I'd make two concrete suggestions to cut back your heap allocations:
Since every instance of the inner vector has two elements, you should consider using a std::array (or std::tr1::array or boost::array); the array "container" doesn't use heap allocation for its elements (they are stored like a C array).
Since you know roughly how many elements you are going to put into the result vector, you can reserve() sufficient space for those elements before inserting them.
From your description we have very little to go on, however, let me see if I can help:
You have implemented a lock-based system but you aren't judiciously using the resources of the second, third, or fourth threads because the entity that they require is constantly locked. (this is a very real and obvious area I'd look into first)
You're not actually using more than a single thread. Somehow, somewhere, those other threads aren't even fired up or initialized. (sounds stupid but I've done this before)
Look into those areas first.