How does a memory leak improve performance - c++

I'm building a large RTree (spatial index) full of nodes. It needs to be able to handle many queries AND updates. Objects are continuously being created and destroyed. The basic test I'm running is to see the performance of the tree as the number of objects in the tree increases. I insert from 100-20000 uniformly size, randomly located objects in increments of 100. Searching and updating are irrelevant to the issue I am currently faced with.
Now, when there is NO memory leak the "insert into tree" performance is everywhere. It goes anywhere from 10.5 seconds with ~15000 objects to 1.5 with ~18000. There is no pattern whatsoever.
When I deliberately add in a leak, as simple as putting in "new int;" I don't assign it to anything, that right there is a line to itself, the performance instantly falls onto a nice gentle curve sloping from 0 (roughly) seconds for 100 objects to 1.5 for the full 20k.
Very, very lost at this point. If you want source code I can include it but it's huuugggeeee and literally the only line that makes a difference is "new int;"
Thanks in advance!
-nick

I'm not sure how you came up with this new int test, but it's not very good way to fix things :) Run your code using a profiler and find out where the real delays are. Then concentrate on fixing the hot spots.
g++ has it built in - just compile with -pg

Without more information it's impossible to be sure.
However I wonder if this is to do with heap fragmentation. By creating a freeing many blocks of memory you'll likely be creating a whole load of small fragments of memory linked together.The memory manager needs to keep track of them all so it can allocate them again if needed.
Some memory managers when you free a block try to "merge" it with surrounding blocks of memory and on a highly fragmented heap this can be very slow as it tries to find the surrounding blocks. Not only this, but if you have limited physical memory it can "touch " many physical pages of memory as it follows the chain of memory blocks which can cause a whole load of extremely slow page faults which will be very variable in speed depending on exactly how much physical memory the OS decides to give that process.
By leaving some un-freed memory you will be changing this pattern of access which might make a large difference to the speed. You might for example be forcing the run time library to allocate new block of memory each time rather than having to track down a suitably sized existing block to reuse.
I have no evidence this is the case in your program, but I do know that memory fragmentation is often the causes of slow programs when a lot of new and free is performed.

The possible thing that is happening which explains this (a theory)
The compiler did not remove the empty new int
The new int is in one of the inner loops or somewhere in your recursive traversal wherein it gets executed the most amount of time
The overall RSS of the process increases and eventually the total memory being used by the process
There are page faults happening because of this
Because of the page-faults, the process becomes I/O bound instead of being CPU bound
End result, you see a drop in the throughput. It will help if you can mention the compiler being used and the options for the compiler that you are using to build the code.

I am taking a stab in the dark here but the problem could be the way the heap gets fragmented. You said that you are creating a destroying large numbers of objects. I will assume that the objects are all of different size.
When one allocates memory on the heap, a cell the size needed is broken off from the heap. When the memory is freed, the cell is added to a freelist. When one does a new alloc, the allocator walks the heap until a cell that is big enough is found. When doing large numbers of allocations, the free list can get rather long and walking the list can take a non-trivial amount of time.
Now an int is rather small. So when you do your new int, it may well eat up all the small heap cells on the free list and thus dramatically speed up larger allocations.
The chances are, however that you are allocating and freeing similar sized objects. If you use your own freelists, you will safe yourself many heap walks and may dramatically improve performance. This is exactly what the STL allocators do to improve performance.

Solution: Do not run from Visual Studio. Actually run the .exe file. Figured this out because that's what the profilers were doing and the numbers were magically dropping. Checked memory usage and version running (and giving me EXCEPTIONAL times) was not blowing up to excessively huge sizes.
Solution to why the hell Visual Studio does ridiculous crap like this: No clue.

Related

Allocating ~10GB of vectors - how can I speed it up?

I'm loading about ~1000 files, each representing an array of ~3 million floats. I need to have them all in memory together as I need to do some calculations that involve all of them.
In the code below, I've broken out the memory allocation and file reading, so I can observe the speed of each separately. I was a bit surprised to find the memory allocation taking much longer than the file reading.
std::vector<std::vector<float> * > v(matrix_count);
for(int i=0; i < matrix_count; i++) {
v[i] = new std::vector<float>(array_size);
}
for(int i=0; i < matrix_count; i++) {
std::ifstream is(files[i]);
is.read((char*) &((*v[i])[0]), size);
is.close();
}
Measuring the time, the allocating loop took 6.8s while file loading took 2.5s. It seems counter-intuitive that reading from the disk is almost 3x faster than just allocating space for it.
Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.
Is there something I could do to speed up the memory allocation? I tried allocating one large vector instead, but that failed with bad_malloc -- I guess a 10GB vector isn't ok.
I mainly wanted to respond by addressing this one part: bad_alloc exceptions tend to be misunderstood. They're not the result of "running out of memory" -- they're the result of the system failing to find a contiguous block of unused pages. You could have plenty more than enough memory available and still get a bad_alloc if you get in the habit of trying to allocate massive blocks of contiguous memory, simply because the system can't find a contiguous set of pages that are free. You can't necessarily avoid bad_alloc by "making sure plenty of memory is free" as you might have already seen where having over 100 gigabytes of RAM can still make you vulnerable to them when trying to allocate a mere 10 GB block. The way to avoid them is to allocate memory in smaller chunks instead of one epic array. At a large enough scale, structures like unrolled lists can start to offer favorable performance over a gigantic array and a much lower (exponentially) probability of ever getting a bad_alloc exception unless you actually do exhaust all the memory available. There is actually a peak where contiguity and the locality of reference it provides ceases to become beneficial and may actually hinder memory performance at a large enough size (mainly due to paging, not caching).
For the kind of epic scale input you're handling, you might actually get better performance out of std::deque given the page-friendly nature of it (it's one of the few times where deque can really shine without need for push_front vs. vector). It's something to potentially try if you don't need perfect contiguity.
Naturally it's best if you measure this with an actual profiler. It'll help us hone in on the actual problem, though it might not be completely shocking (surprising but maybe not shocking) that you might be bottlenecked by memory here instead of disk IO given the kind of "massive number of massive blocks" you're allocating (disk IO is slow but memory heap allocation can sometimes be expensive if you are really stressing the system). It depends a lot on the system's allocation strategy but even slab or buddy allocators can fall back to a much slower code branch if you allocate such epic blocks of memory and en masse, and allocations may even start to require something resembling a search or more access to secondary storage in those extreme cases (here I'm afraid I'm not sure exactly what goes on behind the hood when allocating so many massive blocks, but I have "felt" and measured these kinds of bottlenecks before but in a way where I never quite figured out what the OS was doing exactly -- this above paragraph is purely conjecture).
Here it's kind of counter-intuitive but you can often get better performance allocating a larger number of smaller blocks. Typically that makes things worse, but if we're talking about 3 million floats per memory block and a thousand memory blocks like it, it might help to start allocating in, say, page-friendly 4k chunks. Typically it's cheaper to pre-allocate memory in large blocks in advance and pool it, but "large" in this case is more like 4 kilobyte blocks, not 10 gigabyte blocks.
std::deque will typically do this kind of thing for you so it might be the quickest thing to try out to see if it helps. With std::deque, you should be able to make a single one for all 10 GB worth of contents without splitting it into smaller ones to avoid bad_alloc. It also doesn't have the zero-initialization overhead of the entire contents that some cited, and push_backs to it are constant-time even in the worst-case scenario (not amortized constant time as with std::vector), so I would try std::deque with actually push_back instead of pre-sizing it and using operator[]. You could read the file contents in small chunks at a time (ex: using 4k byte buffers) and just push back the floats. It's something to try anyway.
Anyway, these are all just educated guesses without code and profiling measurements, but these are some things to try out after your measurements.
MMFs may also be the ideal solution for this scenario. Let the OS handle all the tricky details of what it takes to access the file's contents.
Use multiple threads for both memory allocation and reading files. You can create a set of say 15 threads and let each thread pick up the next available job.
When you dig deeper, you will see that opening the file also has a considerable overhead which gets reduced substantially by using multiple threads.
You don't need to handle all the data in memory. Instead of that, you should use something like virtual vector which loads required data when needed. Using that approach saves the memory and don't brings your to side effects of huge memory allocation.

Is increase in memory usage of my code possible without any memory leak?

There is a C++ code which does some calculations in iterations of a loop. When I run my code for a couple of hours, no increase in memory usage is observable. But when I let it run over night, 50 MB increase in memory usage is observed by MS Performance Monitor tool (perfmon.exe) as shown below. Plots are only for my process not the entire system.
Visual Leak Detector cannot detect any memory leak inside the implemented loop. Is it possible that cause of increase in memory usage is something other than memory leak?
The Visual Leak Detector could be fooled by memory usage patterns which look legitimate while in fact they are not. For example, if you keep allocating some things and not only forgetting to free them, but also keeping pointers to them in a list, this looks like normal memory usage as far as any dumb tool can tell.
Also, 50MB are not such a huge amount of memory for a desktop application to worry about, and in any case your observations are too limited to be able to draw any conclusions. It could be that the memory manager of your C++ runtime considers these 50MB peanuts worth sacrificing for efficiency, so it may be choosing not to bother with merging adjacent free blocks to satisfy allocation requests when there is more fresh OS memory available. In order to better theorize about what is going on you need to show us a more complete graph of the memory allocation of your application over time. Is it a continuous straight slope? Is it a slope which turned into a flat line at some point? Does it have a sudden stair step somewhere? Is it a slope which keeps rising even after you have completely ran out of physical memory and started paging?

Increase in memory footprint. False alarm or a memory leak?

I have a graphics program where I am creating and destroying the same objects over and over again. All in all, there are 140 objects. They get deleted and newed such that the number never increases 140. This is a requirement as it is a stress test, that is I cannot have a memory pool or dummy objects. Now I am fairly certain there aren't any memory leaks. I am also using a memory leak detector which is not reporting any leaks.
The problem is that the memory footprint of the program keeps increasing (albeit quite slowly, slower than the rate at which the objects are being destroyed/created). So my question then is whether an increasing memory footprint is a solid sign for memory leaks or can it sometimes be deceiving?
EDIT: I am using new/delete to create/destroy the objects
It does seem possible that this behavior could come from a situation in which there is no leak.
Is there any chance that your heap is getting fragmented?
Say you make lots of allocations of size n. You free them all, which makes your C library insert those buffers into a free list. Some other code path then makes allocations smaller than n, so those blocks in the free list get chunked up into smaller units. Then the next iteration of the loop does another batch of allocations of size n, and the free list no longer contains contiguous memory at that size, and malloc has to ask the kernel for more memory. Eventually those "smaller-than-n" allocations get freed as would your "n-sized" ones, but if you run enough iterations where the fragmentation exists, I could see the process gradually increasing its memory footprint.
One way to avoid this might be to allocate all your objects once, and not keep allocating/freeing them. Since you're using C++ this might necessitate placement new or something similar. Since you are using Windows, I might also mention that Win32 supports having multiple heaps in a process, so if your objects come from a different heap than other allocations you may avoid this.
It depends if you're under a CLR (or a virtual machine with garbage collector) or your still in the old mode (like C++, MFC ect...)
When you have a GC around - you can't really tell, only if you test it long enough. GC can decide not to clean your objects for now... (there is a way to force it)
In the native applications, yes, a footprint increase might mean a leak.
there are some tools (very good tools) for c++ that find these leaks (google devpartner or boundschecker)
I guess there are some tools for c# and Java as well.
If your application's process footprint increases beyond a reasonable limit, which depends on your application and what it does, and continues to increase until eventually you (will) run out of virtual memory, you definitely have a memory leak.
Try memory allocation tests included in CRT: http://msdn.microsoft.com/en-us/library/e5ewb1h3%28VS.80%29.aspx
They help A LOT.
But I've noticed that apps do tend to vary their memory consumption a little if you look at some factors. Windows 7 might also create extra padding in memory allocation to fix bugs: http://msdn.microsoft.com/en-us/library/dd744764%28VS.85%29.aspx
I strongly suggest to try Visual Studio 2015 (the Community Edition is free). It comes with Diagnostic Tools that helps you analyze Memory Usage; it allows you to take snapshots and view the heap

Memory management while loading huge XML files

We have an application which imports objects from an XML. The XML is around 15 GB. The application invariably starts running out of memory. We tried to free memory in between operations but this has lead to degrading performance. i.e it takes more time to complete the import operation. The CPU utilization reaches 100%
The application is written in C++.
Does the frequent call to free() will lead to performance issues?
Promoted from a comment by the OP: the parser being used in expat, which is a SAX parser with a very small footprint, and customisable memory management.
Use SAX parser instead of DOM parser.
Have you tried resuing the memory and your classes as opposed to freeing and reallocating it? Constant allocation/deallocation cycles, especially if they are coupled with small (less than 4096 bytes) data fragments can lead to serious performance problems and memory address space fragmentation.
Profile the application during one of these bothersome loads, to see where it is spending most of its time.
I believe that free() can sometimes be costly, but that is of course very much dependent on the platform's implementation.
Also, you don't say a lot about the lifelength of the objects loaded; if the XML is 15 GB, how much of that is kept around for each "object", once the markup is parsed and thrown away?
It sounds sensible to process an input document of this size in a streaming fashion, i.e. not trying a DOM-approach which loads and builds the entire XML parse tree at once.
If you want to minimise your memory usage, took a look at How to read the XML data from a file by using Visual C++.
One thing that often helps is to use a lightweight low-overhead memory pool. If you combine this with "frame" allocation methods (ignoring any delete/free until you're all done with the data), you can get something that's ridiculously fast.
We did this for an embedded system recently, mostly for performance reasons, but it saved a lot of memory as well.
The trick was basically to allocate a big block -- slightly bigger than we'd need (you could allocate a chain of blocks if you like) -- and just keep returning a "current" pointer (bumping it up by allocSize, rounded up to maximum align requirement of 4 in our case, each time). This cut our overhead per alloc from on the order of 52-60 bytes down to <= 3 bytes. We also ignored "free" calls until we were all done parsing and then freed the whole block.
If you're clever enough with your frame allocation you can save a lot of space and time. It might not get you all the way to your 15GiB, but it would be worth looking at how much space overhead you really have... My experience with DOM-based systems is that they use tons of small allocs, each with a relatively high overhead.
(If you have virtual memory, a large "block" might not even hurt that much, if your access at any given time is local to a page or three anyway...)
Obviously you have to keep the memory you actually need in the long run, but the parser's "scratch memory" becomes a lot more efficient this way.
We tried to free memory in between operations but this has lead to degrading performance .
Does the frequent call to free() will lead to performance issues ?
Based on the evidence supplied, yes.
Since your already using expat, a SAX parser, what exactly are you freeing? If you can free it, why are you mallocing it in a loop in the first place?
Maybe, it should say profiler.
Also don't forget that work with heap is single-thread. I mean that if booth of your threads will allocate/free memory in ont time, one of them will waiting when first will done.
If you allocating and free memory for same objects, you could create pool of this object and do allocate/free once.
Try and find a way to profile your code.
If you have no profiler, try and organize your work so that you only have a few free() commands (instead of the many you suggest).
It is very common to have to find the right balance between memory consumption and time efficiency.
I did not try it myself, but have you heard of XMLLite, there's an MSDN artical introducing it. It's used by MS Office internally.

How to solve Memory Fragmentation

We've occasionally been getting problems whereby our long-running server processes (running on Windows Server 2003) have thrown an exception due to a memory allocation failure. Our suspicion is these allocations are failing due to memory fragmentation.
Therefore, we've been looking at some alternative memory allocation mechanisms that may help us and I'm hoping someone can tell me the best one:
1) Use Windows Low-fragmentation Heap
2) jemalloc - as used in Firefox 3
3) Doug Lea's malloc
Our server process is developed using cross-platform C++ code, so any solution would be ideally cross-platform also (do *nix operating systems suffer from this type of memory fragmentation?).
Also, am I right in thinking that LFH is now the default memory allocation mechanism for Windows Server 2008 / Vista?... Will my current problems "go away" if our customers simply upgrade their server os?
First, I agree with the other posters who suggested a resource leak. You really want to rule that out first.
Hopefully, the heap manager you are currently using has a way to dump out the actual total free space available in the heap (across all free blocks) and also the total number of blocks that it is divided over. If the average free block size is relatively small compared to the total free space in the heap, then you do have a fragmentation problem. Alternatively, if you can dump the size of the largest free block and compare that to the total free space, that will accomplish the same thing. The largest free block would be small relative to the total free space available across all blocks if you are running into fragmentation.
To be very clear about the above, in all cases we are talking about free blocks in the heap, not the allocated blocks in the heap. In any case, if the above conditions are not met, then you do have a leak situation of some sort.
So, once you have ruled out a leak, you could consider using a better allocator. Doug Lea's malloc suggested in the question is a very good allocator for general use applications and very robust most of the time. Put another way, it has been time tested to work very well for most any application. However, no algorithm is ideal for all applications and any management algorithm approach can be broken by the right pathelogical conditions against it's design.
Why are you having a fragmentation problem? - Sources of fragmentation problems are caused by the behavior of an application and have to do with greatly different allocation lifetimes in the same memory arena. That is, some objects are allocated and freed regularly while other types of objects persist for extended periods of time all in the same heap.....think of the longer lifetime ones as poking holes into larger areas of the arena and thereby preventing the coalesce of adjacent blocks that have been freed.
To address this type of problem, the best thing you can do is logically divide the heap into sub arenas where the lifetimes are more similar. In effect, you want a transient heap and a persistent heap or heaps that group things of similar lifetimes.
Some others have suggested another approach to solve the problem which is to attempt to make the allocation sizes more similar or identical, but this is less ideal because it creates a different type of fragmentation called internal fragmentation - which is in effect the wasted space you have by allocating more memory in the block than you need.
Additionally, with a good heap allocator, like Doug Lea's, making the block sizes more similar is unnecessary because the allocator will already be doing a power of two size bucketing scheme that will make it completely unnecessary to artificially adjust the allocation sizes passed to malloc() - in effect, his heap manager does that for you automatically much more robustly than the application will be able to make adjustments.
I think you’ve mistakenly ruled out a memory leak too early.
Even a tiny memory leak can cause a severe memory fragmentation.
Assuming your application behaves like the following:
Allocate 10MB
Allocate 1 byte
Free 10MB
(oops, we didn’t free the 1 byte, but who cares about 1 tiny byte)
This seems like a very small leak, you will hardly notice it when monitoring just the total allocated memory size.
But this leak eventually will cause your application memory to look like this:
.
.
Free – 10MB
.
.
[Allocated -1 byte]
.
.
Free – 10MB
.
.
[Allocated -1 byte]
.
.
Free – 10MB
.
.
This leak will not be noticed... until you want to allocate 11MB
Assuming your minidumps had full memory info included, I recommend using DebugDiag to spot possible leaks.
In the generated memory report, examine carefully the allocation count (not size).
As you suggest, Doug Lea's malloc might work well. It's cross platform and it has been used in shipping code. At the very least, it should be easy to integrate into your code for testing.
Having worked in fixed memory environments for a number of years, this situation is certainly a problem, even in non-fixed environments. We have found that the CRT allocators tend to stink pretty bad in terms of performance (speed, efficiency of wasted space, etc). I firmly believe that if you have extensive need of a good memory allocator over a long period of time, you should write your own (or see if something like dlmalloc will work). The trick is getting something written that works with your allocation patterns, and that has more to do with memory management efficiency as almost anything else.
Give dlmalloc a try. I definitely give it a thumbs up. It's fairly tunable as well, so you might be able to get more efficiency by changing some of the compile time options.
Honestly, you shouldn't depend on things "going away" with new OS implementations. A service pack, patch, or another new OS N years later might make the problem worse. Again, for applications that demand a robust memory manager, don't use the stock versions that are available with your compiler. Find one that works for your situation. Start with dlmalloc and tune it to see if you can get the behavior that works best for your situation.
You can help reduce fragmentation by reducing the amount you allocate deallocate.
e.g. say for a web server running a server side script, it may create a string to output the page to. Instead of allocating and deallocating these strings for every page request, just maintain a pool of them, so your only allocating when you need more, but your not deallocating (meaning after a while you get the situation you not allocating anymore either, because you have enough)
You can use _CrtDumpMemoryLeaks(); to dump memory leaks to the debug window when running a debug build, however I believe this is specific to the Visual C compiler. (it's in crtdbg.h)
I'd suspect a leak before suspecting fragmentation.
For the memory-intensive data structures, you could switch over to a re-usable storage pool mechanism. You might also be able to allocate more stuff on the stack as opposed to the heap, but in practical terms that won't make a huge difference I think.
I'd fire up a tool like valgrind or do some intensive logging to look for resources not being released.
#nsaners - I'm pretty sure the problem is down to memory fragmentation. We've analyzed minidumps that point to a problem when a large (5-10mb) chunk of memory is being allocated. We've also monitored the process (on-site and in development) to check for memory leaks - none were detected (the memory footprint is generally quite low).
The problem does happen on Unix, although it's usually not as bad.
The Low-framgmentation heap helped us, but my co-workers swear by Smart Heap
(it's been used cross platform in a couple of our products for years). Unfortunately due to other circumstances we couldn't use Smart Heap this time.
We also look at block/chunking allocating and trying to have scope-savvy pools/strategies, i.e.,
long term things here, whole request thing there, short term things over there, etc.
As usual, you can usually waste memory to gain some speed.
This technique isn't useful for a general purpose allocator, but it does have it's place.
Basically, the idea is to write an allocator that returns memory from a pool where all the allocations are the same size. This pool can never become fragmented because any block is as good as another. You can reduce memory wastage by creating multiple pools with different size chunks and pick the smallest chunk size pool that's still greater than the requested amount. I've used this idea to create allocators that run in O(1).
if you talking about Win32 - you can try to squeeze something by using LARGEADDRESSAWARE. You'll have ~1Gb extra defragmented memory so your application will fragment it longer.
The simple, quick and dirty, solution is to split the application into several process, you should get fresh HEAP each time you create the process.
Your memory and speed might suffer a bit (swapping) but fast hardware and big RAM should be able to help.
This was old UNIX trick with daemons, when threads did not existed yet.