Memory management while loading huge XML files

Memory management while loading huge XML files - c++

We have an application which imports objects from an XML. The XML is around 15 GB. The application invariably starts running out of memory. We tried to free memory in between operations but this has lead to degrading performance. i.e it takes more time to complete the import operation. The CPU utilization reaches 100%
The application is written in C++.
Does the frequent call to free() will lead to performance issues?
Promoted from a comment by the OP: the parser being used in expat, which is a SAX parser with a very small footprint, and customisable memory management.

Use SAX parser instead of DOM parser.

Have you tried resuing the memory and your classes as opposed to freeing and reallocating it? Constant allocation/deallocation cycles, especially if they are coupled with small (less than 4096 bytes) data fragments can lead to serious performance problems and memory address space fragmentation.

Profile the application during one of these bothersome loads, to see where it is spending most of its time.
I believe that free() can sometimes be costly, but that is of course very much dependent on the platform's implementation.
Also, you don't say a lot about the lifelength of the objects loaded; if the XML is 15 GB, how much of that is kept around for each "object", once the markup is parsed and thrown away?
It sounds sensible to process an input document of this size in a streaming fashion, i.e. not trying a DOM-approach which loads and builds the entire XML parse tree at once.

If you want to minimise your memory usage, took a look at How to read the XML data from a file by using Visual C++.

One thing that often helps is to use a lightweight low-overhead memory pool. If you combine this with "frame" allocation methods (ignoring any delete/free until you're all done with the data), you can get something that's ridiculously fast.
We did this for an embedded system recently, mostly for performance reasons, but it saved a lot of memory as well.
The trick was basically to allocate a big block -- slightly bigger than we'd need (you could allocate a chain of blocks if you like) -- and just keep returning a "current" pointer (bumping it up by allocSize, rounded up to maximum align requirement of 4 in our case, each time). This cut our overhead per alloc from on the order of 52-60 bytes down to <= 3 bytes. We also ignored "free" calls until we were all done parsing and then freed the whole block.
If you're clever enough with your frame allocation you can save a lot of space and time. It might not get you all the way to your 15GiB, but it would be worth looking at how much space overhead you really have... My experience with DOM-based systems is that they use tons of small allocs, each with a relatively high overhead.
(If you have virtual memory, a large "block" might not even hurt that much, if your access at any given time is local to a page or three anyway...)
Obviously you have to keep the memory you actually need in the long run, but the parser's "scratch memory" becomes a lot more efficient this way.

We tried to free memory in between operations but this has lead to degrading performance .
Does the frequent call to free() will lead to performance issues ?
Based on the evidence supplied, yes.
Since your already using expat, a SAX parser, what exactly are you freeing? If you can free it, why are you mallocing it in a loop in the first place?

Maybe, it should say profiler.
Also don't forget that work with heap is single-thread. I mean that if booth of your threads will allocate/free memory in ont time, one of them will waiting when first will done.
If you allocating and free memory for same objects, you could create pool of this object and do allocate/free once.

Try and find a way to profile your code.
If you have no profiler, try and organize your work so that you only have a few free() commands (instead of the many you suggest).
It is very common to have to find the right balance between memory consumption and time efficiency.

I did not try it myself, but have you heard of XMLLite, there's an MSDN artical introducing it. It's used by MS Office internally.

Related

Which memory allocation algorithm suits best for performance and time critical c++ applications?

I ask this question to determine which memory allocation algorithm gives better results with performance critical applications, like game engines, or embedded applications. Results are actually depends percentage of memory fragmented and time-determinism of memory request.
There are several algorithms in the text books (e.g. Buddy memory allocation), but also there are others like TLSF. Therefore, regarding memory allocation algorithms available, which one of them is fastest and cause less fragmentation. BTW, Garbage collectors should be not included.
Please also, note that this question is not about profiling, it just aims to find out optimum algorithm for given requirements.

It all depends on the application. Server applications which can clear out all memory relating to a particular request at defined moments will have a different memory access pattern than video games, for instance.
If there was one memory allocation algorithm that was always best for performance and fragmentation, wouldn't the people implementing malloc and new always choose that algorithm?
Nowadays, it's usually best to assume that the people who wrote your operating system and runtime libraries weren't brain dead; and unless you have some unusual memory access pattern don't try to beat them.
Instead, try to reduce the number of allocations (or reallocations) you make. For instance, I often use a std::vector, but if I know ahead of time how many elements it will have, I can reserve that all in one go. This is much more efficient than letting it grow "naturally" through several calls to push_back().
Many people coming from languages where new just means "gimme an object" will allocate things for no good reason. If you don't have to put it on the heap, don't call new.
As for fragmentation: it still depends. Unfortunately I can't find the link now, but I remember a blog post from somebody at Microsoft who had worked on a C++ server application that suffered from memory fragmentation. The team solved the problem by allocating memory from two regions. Memory for all requests would come from region A until it was full (requests would free memory as normal). When region A was full, all memory would be allocated from region B. By the time region B was full, region A was completely empty again. This solved their fragmentation problem.
Will it solve yours? I have no idea. Are you working on a project which services several independent requests? Are you working on a game?
As for determinism: it still depends. What is your deadline? What happens when you miss the deadline (astronauts lost in space? the music being played back starts to sound like garbage?)? There are real time allocators, but remember: "real time" means "makes a promise about meeting a deadline," not necessarily "fast."
I did just come across a post describing various things Facebook has done to both speed up and reduce fragmentation in jemalloc. You may find that discussion interesting.

Barış:
Your question is very general, but here's my answer/guidance:
I don't know about game engines, but for embedded and real time applications, The general goals of an allocation algorithm are:
1- Bounded execution time: You have to know in advance the worst case allocation time so you can plan your real time tasks accordingly.
2- Fast execution: Well, the faster the better, obviously
3- Always allocate: Especially for real-time, security critical applications, all requests must be satisfied. If you request some memory space and get a null pointer: trouble!
4- Reduce fragmentation: Although this depends on the algorithm used, generally, less fragmented allocations provide better performance, due to a number of reasons, including caching effects.
In most critical systems, you are not allowed to dynamically allocate any memory to begin with. You analyze your requirements and determine your maximum memory use and allocate a large chunk of memory as soon as your application starts. If you can't, then the application does not even start, if it does start, no new memory blocks are allocated during execution.
If speed is a concern, I'd recommend following a similar approach. You can implement a memory pool which manages your memory. The pool could initialize a "sufficient" block of memory in the start of your application and serve your memory requests from this block. If you require more memory, the pool can do another -probably large- allocation (in anticipation of more memory requests), and your application can start using this newly allocated memory. There are various memory pooling schemes around as well, and managing these pools is another whole topic.
As for some examples: VxWorks RTOS used to employ a first-fit allocation algorithm where the algorithm analyzed a linked list to find a big enough free block. In VxWorks 6, they're using a best-fit algorithm, where the free space is kept in a tree and allocations traverse the tree for a big enough free block. There's a white paper titled Memory Allocation in VxWorks 6.0, by Zoltan Laszlo, which you can find by Googling, that has more detail.
Going back to your question about speed/fragmentation: It really depends on your application. Things to consider are:
Are you going to make lots of very small allocations, or relatively larger ones?
Will the allocations come in bursts, or spread equally throughout the application?
What is the lifetime of the allocations?
If you're asking this question because you're going to implement your own allocator, you should probably design it in such a way that you can change the underlying allocation/deallocation algorithm, because if the speed/fragmentation is really that critical in your application, you're going to want to experiment with different allocators. If I were to recommend something without knowing any of your requirements, I'd start with TLSF, since it has good overall characteristics.

As other already wrote, there is no "optimum algorithm" for each possible application. It was already proven that for any possible algorithm you can find an allocation sequence which will cause a fragmentation.
Below I write a few hints from my game development experience:
Avoid allocations if you can
A common practices in the game development field was (and to certain extent still is) to solve the dynamic memory allocation performance issues by avoiding the memory allocations like a plague. It is quite often possible to use stack based memory instead - even for dynamic arrays you can often come with an estimate which will cover 99 % of cases for you and you need to allocate only when you are over this boundary. Another commonly used approach is "preallocation": estimate how much memory you will need in some function or for some object, create a kind of small and simplistic "local heap" you allocate up front and perform the individual allocations from this heap only.
Memory allocator libraries
Another option is to use some of the memory allocation libraries - they are usually created by experts in the field to fit some special requirements, and if you have similar requiremens, they may fit your requirements.
Multithreading
There is one particular case in which you will find the "default" OS/CRT allocator performs badly, and that is multithreading. If you are targeting Windows, by aware both OS and CRT allocators provided by Microsoft (including the otherwise excellent Low Fragmentation Heap) are currently blocking. If you want to perform significant threading, you need either to reduce the allocation as much as possible, or to use some of the alternatives. See Can multithreading speed up memory allocation?

The best practice is - use whatever you can use to make the thing done in time (in your case - default allocator). If the whole thing is very complex - write tests and samples that will emulate parts of the whole thing. Then, run performance tests and benchmarks to find bottle necks (probably they will nothing to do with memory allocation :).
From this point you will see what exactly slowdowns your code and why. Only based on such precise knowledge you can ever optimize something and choose one algorithm over another. Without tests its just a waste of time since you can't even measure how much your optimization will speedup your app (in fact such "premature" optimizations can really slowdown it).
Memory allocation is a very complex thing and it really depends on many factors. For example, such allocator is simple and damn fast but can be used only in limited number of situations:
char pool[MAX_MEMORY_REQUIRED_TO_RENDER_FRAME];
char *poolHead = pool;
void *alloc(size_t sz) { char *p = poolHead; poolHead += sz; return p; }
void free() { poolHead = pool; }
So there is no "the best algorithm ever".

One constraint that's worth mentioning, which has not been mentioned yet, is multi-threading: Standard allocators must be implemented to support several threads, all allocating/deallocating concurrently, and passing objects from one thread to another so that it gets deallocated by a different thread.
As you may have guessed from that description, it is a tricky task to implement an allocator that handles all of this well. And it does cost performance as it is impossible to satisfy all these constrains without inter-thread communication (= use of atomic variables and locks) which is quite costly.
As such, if you can avoid concurrency in your allocations, you stand a good chance to implement your own allocator that significantly outperforms the standard allocators: I once did this myself, and it saved me roughly 250 CPU cycles per allocation with a fairly simple allocator that's based on a number of fixed sized memory pools for small objects, stacking free objects with an intrusive linked list.
Of course, avoiding concurrency is likely a no-go for you, but if you don't use it anyway, exploiting that fact might be something worth thinking about.

How does a memory leak improve performance

I'm building a large RTree (spatial index) full of nodes. It needs to be able to handle many queries AND updates. Objects are continuously being created and destroyed. The basic test I'm running is to see the performance of the tree as the number of objects in the tree increases. I insert from 100-20000 uniformly size, randomly located objects in increments of 100. Searching and updating are irrelevant to the issue I am currently faced with.
Now, when there is NO memory leak the "insert into tree" performance is everywhere. It goes anywhere from 10.5 seconds with ~15000 objects to 1.5 with ~18000. There is no pattern whatsoever.
When I deliberately add in a leak, as simple as putting in "new int;" I don't assign it to anything, that right there is a line to itself, the performance instantly falls onto a nice gentle curve sloping from 0 (roughly) seconds for 100 objects to 1.5 for the full 20k.
Very, very lost at this point. If you want source code I can include it but it's huuugggeeee and literally the only line that makes a difference is "new int;"
Thanks in advance!
-nick

I'm not sure how you came up with this new int test, but it's not very good way to fix things :) Run your code using a profiler and find out where the real delays are. Then concentrate on fixing the hot spots.
g++ has it built in - just compile with -pg

Without more information it's impossible to be sure.
However I wonder if this is to do with heap fragmentation. By creating a freeing many blocks of memory you'll likely be creating a whole load of small fragments of memory linked together.The memory manager needs to keep track of them all so it can allocate them again if needed.
Some memory managers when you free a block try to "merge" it with surrounding blocks of memory and on a highly fragmented heap this can be very slow as it tries to find the surrounding blocks. Not only this, but if you have limited physical memory it can "touch " many physical pages of memory as it follows the chain of memory blocks which can cause a whole load of extremely slow page faults which will be very variable in speed depending on exactly how much physical memory the OS decides to give that process.
By leaving some un-freed memory you will be changing this pattern of access which might make a large difference to the speed. You might for example be forcing the run time library to allocate new block of memory each time rather than having to track down a suitably sized existing block to reuse.
I have no evidence this is the case in your program, but I do know that memory fragmentation is often the causes of slow programs when a lot of new and free is performed.

The possible thing that is happening which explains this (a theory)
The compiler did not remove the empty new int
The new int is in one of the inner loops or somewhere in your recursive traversal wherein it gets executed the most amount of time
The overall RSS of the process increases and eventually the total memory being used by the process
There are page faults happening because of this
Because of the page-faults, the process becomes I/O bound instead of being CPU bound
End result, you see a drop in the throughput. It will help if you can mention the compiler being used and the options for the compiler that you are using to build the code.

I am taking a stab in the dark here but the problem could be the way the heap gets fragmented. You said that you are creating a destroying large numbers of objects. I will assume that the objects are all of different size.
When one allocates memory on the heap, a cell the size needed is broken off from the heap. When the memory is freed, the cell is added to a freelist. When one does a new alloc, the allocator walks the heap until a cell that is big enough is found. When doing large numbers of allocations, the free list can get rather long and walking the list can take a non-trivial amount of time.
Now an int is rather small. So when you do your new int, it may well eat up all the small heap cells on the free list and thus dramatically speed up larger allocations.
The chances are, however that you are allocating and freeing similar sized objects. If you use your own freelists, you will safe yourself many heap walks and may dramatically improve performance. This is exactly what the STL allocators do to improve performance.

Solution: Do not run from Visual Studio. Actually run the .exe file. Figured this out because that's what the profilers were doing and the numbers were magically dropping. Checked memory usage and version running (and giving me EXCEPTIONAL times) was not blowing up to excessively huge sizes.
Solution to why the hell Visual Studio does ridiculous crap like this: No clue.

Will Garbage Collected C be Faster Than C++?

I had been wondering for quite some time on how to manager memory in my next project. Which is writing a DSL in C/C++.
It can be done in any of the three ways.
Reference counted C or C++.
Garbage collected C.
In C++, copying class and structures from stack to stack and managing strings separately with some kind of GC.
The community probably already has a lot of experience on each of these methods. Which one will be faster? What are the pros and cons for each?
A related side question. Will malloc/free be slower than allocating a big chunk at the beginning of the program and running my own memory manager over it? .NET seems to do it. But I am confused why we can't count on OS to do this job better and faster than what we can do ourselves.

It all depends! That's a pretty open question. It needs an essay to answer it!
Hey.. here's one somebody prepared earlier:
http://lambda-the-ultimate.org/node/2552
http://www.hpl.hp.com/personal/Hans_Boehm/gc/issues.html
It depends how big your objects are, how many of them there are, how fast they're being allocated and discarded, how much time you want to invest optimizing and tweaking to make optimizations. If you know the limits of how much memory you need, for fast performance, I would think you can't really beat grabbing all the memory you need from the OS up front, and then managing it yourself.
The reason it can be slow allocating memory from the OS is that it deals with lots of processes and memory on disk and in ram, so to get memory it's got to decide if there is enough. Possibly, it might have to page another processes memory out from ram to disk so it can give you enough. There's lots going on. So managing it yourself (or with a GC collected heap) can be far quicker than going to the OS for each request. Also, the OS usually deals with bigger chunks of memory, so it might round up the size of requests you make meaning you could waste memory.
Have you got a real hard requirement for going super quick? A lot of DSL applications don't need raw performance. I'd suggest going with whatever's simplest to code. You could spend a lifetime writing memory management systems and worrying which is best.

Why would garbage collected C be faster than C++? The only garbage collectors available for C are pretty inefficient things, more designed to plug memory leaks than to actually improve the quality of your code.
In any case, C++ has the potential for reaching better performance with less code (note that it's only a potential. It's also very possible to write C++ code that is far slower than the equivalent C).
Considering the current state of both languages, GC's are not currently going to improve performance in your code. GC's can be made very efficient in languages designed for it. C/C++ are not among those. ;)
Apart from that, it's impossible to say. Languages don't have a speed. It doesn't make sense to ask which language is faster. It depends on 1) the specific code, 2) the compiler that compiles it, and 3) the system it's running on (hardware as well as OS).
malloc is a fairly slow operation, far slower than the .NET equivalents, so yes, if you are performing a lot of small allocations, you may be better off allocating a large pool of memory once, and then using chunks of that.
The reason is that the OS has to find a free chunk of memory, basically by following a linked list of all free memory areas. In .NET, a new() call is basically nothing more than moving the heap pointer as many bytes as required by the allocation.

uh ... It depends how you write the garbage collection system for your DSL. Neither C or C++ comes with a garbage collection facility built-in but either could be used to write a very efficient or a very inefficient garbage collector. Writing such a thing, by the way, is a non-trivial task.
DSLs are often written in higher level languages such as Ruby or Python specifically because the language writer can leverage the garbage collection and other facilities of the language. C and C++ are great for writing full, industrial strength languages but you certainly need to know what you are doing to use them - knowledge of yacc and lex is especially useful here but a good understanding of dynamic memory management is important also, as you say. You could also check out keykit, an open source music DSL written in C, if you still like the idea of a DSL in C/C++.

With most garbage collection implementations, allocation can see a speed improvement, but then you have the additional cost of the collection phase which can be triggered at any point in your program's execution, leading to a sudden (seemingly random) delay.
As for your second question, it depends on your memory management algorithms. You'd be safe sticking with your library's default malloc implementation, but there are alternatives which boast better performance.

A related side question. Will malloc/free be slower than allocating a big chuck at the begining of the program and running my own memory manager over it? .NET seems to do it. But I am confused why we can't count on OS to do this job better and faster than what we can do ourselves.
The problem with letting the OS handle memory allocation is that it introduces indeterministic behaviour. There's no way for the programmer to know how long the OS will take to return a new chunk of memory - an allocation may be quite costly if memory has to be paged out to disk.
Preallocating therefore might be a good idea, especially when using a copying garbage collector. It'll increase memory consumption, but allocation will be fast because in most cases it'll just be a pointer increment.

As people have pointed out - GC is faster to allocate (because it just gives you the next block on its list), but slower overall (because it has to compact the heap regularly, in order for allocs to be fast).
so - go for the compromise solution (which is actually pretty damn good):
You create your own heaps, one for each size of object you generally allocate (or 4-byte, 8 byte, 16-byte, 32-byte, etc) then, when you want a new piece of memory you grab the last 'block' on the appropriate heap. Because you pre-allocate from these heaps, all you need to do when allocating is grab the next free block. This works better than the standard allocator because you are happily wasting memory - if you want to allocate 12 bytes, you'll give up a whole 16 byte block from the 16-byte heap. You keep a bitmap of used v free blocks so you can allocate quickly without wasting loads of memory or needing to compact.
Also, because you're running several heaps, highly-parallel systems work much better as you don't need to lock so often (ie you have multiple locks for each heap so you don't get contention nearly as much)
Try it - we used it to replace the standard heap on a very intensive application, performance went up by quite a lot.
BTW. the reason the standard allocators are slow is that they try not to waste memory - so if you allocate a 5 byte, 7 byte and 32 bytes from the standard heap, it'll keep those 'boundaries'. Next time you need to allocate, it'll walk through those looking for enough space to give you what you asked for. That worked well for low-memory systems, but you only have to look at how much memory most apps use today to see that GC systems go the other way, and try to make allocations as fast as possible whilst caring nothing for how much memory is wasted.

The problem has a lot of variables, but if your application is written with garbage collection in mind, and if you exploit the special features of the Boehm collector, such as different allocation calls for blocks that don't contain pointers, then as a general rule your application
- Will have simpler interfaces
- Will run somewhat faster
- Will require from 1.2x to 2x the space
than a similar application using explicit memory management.
For documentation and evidence supporting these claims, you can see the information on Boehm's web site, and also Ben Zorn's several papers on the measured cost of conservative garbage collection.
Most importantly you'll save a ton of effort and won't have to worry about a significant class of memory-management bugs.
The issue of C vs C++ is orthogonal, but GC will definitely be faster than reference counting, especially when there's no compiler support for reference counting.

Neither C nor C++ will give you garbage for free. What they will give you is memory allocation libraries (which provide malloc/free, etc). There are many online resources to algorithms for writing garbage collection libraries. A good start is link text

Most non GC languages will allocate and de-allocate the memory as needed and no longer needed. GC'd languages usually allocate large chunks of memory before hand and only free the memory when idle and not in the middle of a intensive task so I am going to yes if GC kicks in at correct time.
The D programming language is a garbage collected language and ABI compatible with C and partly ABI compatible with C++. This Page shows some benchmarks between string performance in C++ and D.

I suggest that if you have written a program where memory allocation and deallocation (explicitly or GC'ed) is the bottleneck, then you should re-think your architecture, design and implementation.

If you don't want to explicitly manage memory, don't use C/C++. There are plenty of languages with either reference counting or compiler-supported garbage collectors that will probably work much better for you.
C/C++ are designed in an environment where the programmer manages their own memory. Trying to retrofit GC or ref counting onto them may help some, but you'll find that you either have to compromise the performance of the GC (because it doesn't have any compiler hinting as to where pointers might be), or you'll find new and fascinating ways that you can screw up the reference counts or the GC or whatever.
I know it sounds like a good idea, but really, you should just grab a language more suited to the task.

Reducing memory footprint of large unfamiliar codebase

Suppose you have a fairly large (~2.2 MLOC), fairly old (started more than 10 years ago) Windows desktop application in C/C++. About 10% of modules are external and don't have sources, only debug symbols.
How would you go about reducing application's memory footprint in half? At least, what would you do to find out where memory is consumed?

Override malloc()/free() and new()/delete() with wrappers that keep track of how big the allocations are and (by recording the callstack and later resolving it against the symbol table) where they are made from. On shutdown, have your wrapper display any memory still allocated.
This should enable you both to work out where the largest allocations are and to catch any leaks.

this is description/skeleton of memory tracing application I used to reduce memory consumption of our game by 20%. It helped me to track many allocations done by external modules.

It's not an easy task. Begin by chasing down any memory leaks you cand find (a good tool would be Rational Purify). Skim the source code and try to optimize data structures and/or algorithms.
Sorry if this may sound pessimistic, but cutting down memory usage by 50% doesn't sound realistic.

There is a chance is you can find some significant inefficiencies very fast. First you should check what is the memory used for. A tool which I have found very handy for this is Memory Validator
Once you have this "memory usage map", you can check for Low Hanging Fruit. Are there any data structures consuming a lot of memory which could be represented in a more compact form? This is often possible, esp. when the data access is well encapsulated and when you have a spare CPU power you can dedicate to compressing / decompressing them on each access.

I don't think your question is well posed.
The size of source code is not directly related to the memory footprint. Sure, the compiled code will occupy some memory but the application might will have memory requirements on it's own. Both static (the variables declared in the code) and dynamic (the object the application creates).
I would suggest you to profile program execution and study the code carefully.

First places to start for me would be:
Does the application do a lot of preallocation memory to be used later? Does this memory often sit around unused, never handed out? Consider switching to newing/deleting (or better use a smart_ptr) as needed.
Does the code use a static array such as
Object arrayOfObjs[MAX_THAT_WILL_EVER_BE_USED];
and hand out objs in this array? If so, consider manually managing this memory.

One of the tools for memory usage analysis is LeakDiag, available for free download from Microsoft. It apparently allows to hook all user-mode allocators down to VirtualAlloc and to dump process allocation snapshots to XML at any time. These snapshots then can be used to determine which call stacks allocate most memory and which call stacks are leaking. It lacks pretty frontend for snapshot analysis (unless you can get LDParser/LDGrapher via Microsoft Premier Support), but all the data is there.
One more thing to note is that you may have false leak positives from BSTR allocator due to caching, see "Hey, why am I leaking all my BSTR's?"

How to solve Memory Fragmentation

We've occasionally been getting problems whereby our long-running server processes (running on Windows Server 2003) have thrown an exception due to a memory allocation failure. Our suspicion is these allocations are failing due to memory fragmentation.
Therefore, we've been looking at some alternative memory allocation mechanisms that may help us and I'm hoping someone can tell me the best one:
1) Use Windows Low-fragmentation Heap
2) jemalloc - as used in Firefox 3
3) Doug Lea's malloc
Our server process is developed using cross-platform C++ code, so any solution would be ideally cross-platform also (do *nix operating systems suffer from this type of memory fragmentation?).
Also, am I right in thinking that LFH is now the default memory allocation mechanism for Windows Server 2008 / Vista?... Will my current problems "go away" if our customers simply upgrade their server os?

First, I agree with the other posters who suggested a resource leak. You really want to rule that out first.
Hopefully, the heap manager you are currently using has a way to dump out the actual total free space available in the heap (across all free blocks) and also the total number of blocks that it is divided over. If the average free block size is relatively small compared to the total free space in the heap, then you do have a fragmentation problem. Alternatively, if you can dump the size of the largest free block and compare that to the total free space, that will accomplish the same thing. The largest free block would be small relative to the total free space available across all blocks if you are running into fragmentation.
To be very clear about the above, in all cases we are talking about free blocks in the heap, not the allocated blocks in the heap. In any case, if the above conditions are not met, then you do have a leak situation of some sort.
So, once you have ruled out a leak, you could consider using a better allocator. Doug Lea's malloc suggested in the question is a very good allocator for general use applications and very robust most of the time. Put another way, it has been time tested to work very well for most any application. However, no algorithm is ideal for all applications and any management algorithm approach can be broken by the right pathelogical conditions against it's design.
Why are you having a fragmentation problem? - Sources of fragmentation problems are caused by the behavior of an application and have to do with greatly different allocation lifetimes in the same memory arena. That is, some objects are allocated and freed regularly while other types of objects persist for extended periods of time all in the same heap.....think of the longer lifetime ones as poking holes into larger areas of the arena and thereby preventing the coalesce of adjacent blocks that have been freed.
To address this type of problem, the best thing you can do is logically divide the heap into sub arenas where the lifetimes are more similar. In effect, you want a transient heap and a persistent heap or heaps that group things of similar lifetimes.
Some others have suggested another approach to solve the problem which is to attempt to make the allocation sizes more similar or identical, but this is less ideal because it creates a different type of fragmentation called internal fragmentation - which is in effect the wasted space you have by allocating more memory in the block than you need.
Additionally, with a good heap allocator, like Doug Lea's, making the block sizes more similar is unnecessary because the allocator will already be doing a power of two size bucketing scheme that will make it completely unnecessary to artificially adjust the allocation sizes passed to malloc() - in effect, his heap manager does that for you automatically much more robustly than the application will be able to make adjustments.

I think you’ve mistakenly ruled out a memory leak too early.
Even a tiny memory leak can cause a severe memory fragmentation.
Assuming your application behaves like the following:
Allocate 10MB
Allocate 1 byte
Free 10MB
(oops, we didn’t free the 1 byte, but who cares about 1 tiny byte)
This seems like a very small leak, you will hardly notice it when monitoring just the total allocated memory size.
But this leak eventually will cause your application memory to look like this:
.
.
Free – 10MB
.
.
[Allocated -1 byte]
.
.
Free – 10MB
.
.
[Allocated -1 byte]
.
.
Free – 10MB
.
.
This leak will not be noticed... until you want to allocate 11MB
Assuming your minidumps had full memory info included, I recommend using DebugDiag to spot possible leaks.
In the generated memory report, examine carefully the allocation count (not size).

As you suggest, Doug Lea's malloc might work well. It's cross platform and it has been used in shipping code. At the very least, it should be easy to integrate into your code for testing.
Having worked in fixed memory environments for a number of years, this situation is certainly a problem, even in non-fixed environments. We have found that the CRT allocators tend to stink pretty bad in terms of performance (speed, efficiency of wasted space, etc). I firmly believe that if you have extensive need of a good memory allocator over a long period of time, you should write your own (or see if something like dlmalloc will work). The trick is getting something written that works with your allocation patterns, and that has more to do with memory management efficiency as almost anything else.
Give dlmalloc a try. I definitely give it a thumbs up. It's fairly tunable as well, so you might be able to get more efficiency by changing some of the compile time options.
Honestly, you shouldn't depend on things "going away" with new OS implementations. A service pack, patch, or another new OS N years later might make the problem worse. Again, for applications that demand a robust memory manager, don't use the stock versions that are available with your compiler. Find one that works for your situation. Start with dlmalloc and tune it to see if you can get the behavior that works best for your situation.

You can help reduce fragmentation by reducing the amount you allocate deallocate.
e.g. say for a web server running a server side script, it may create a string to output the page to. Instead of allocating and deallocating these strings for every page request, just maintain a pool of them, so your only allocating when you need more, but your not deallocating (meaning after a while you get the situation you not allocating anymore either, because you have enough)
You can use _CrtDumpMemoryLeaks(); to dump memory leaks to the debug window when running a debug build, however I believe this is specific to the Visual C compiler. (it's in crtdbg.h)

I'd suspect a leak before suspecting fragmentation.
For the memory-intensive data structures, you could switch over to a re-usable storage pool mechanism. You might also be able to allocate more stuff on the stack as opposed to the heap, but in practical terms that won't make a huge difference I think.
I'd fire up a tool like valgrind or do some intensive logging to look for resources not being released.

#nsaners - I'm pretty sure the problem is down to memory fragmentation. We've analyzed minidumps that point to a problem when a large (5-10mb) chunk of memory is being allocated. We've also monitored the process (on-site and in development) to check for memory leaks - none were detected (the memory footprint is generally quite low).

The problem does happen on Unix, although it's usually not as bad.
The Low-framgmentation heap helped us, but my co-workers swear by Smart Heap
(it's been used cross platform in a couple of our products for years). Unfortunately due to other circumstances we couldn't use Smart Heap this time.
We also look at block/chunking allocating and trying to have scope-savvy pools/strategies, i.e.,
long term things here, whole request thing there, short term things over there, etc.

As usual, you can usually waste memory to gain some speed.
This technique isn't useful for a general purpose allocator, but it does have it's place.
Basically, the idea is to write an allocator that returns memory from a pool where all the allocations are the same size. This pool can never become fragmented because any block is as good as another. You can reduce memory wastage by creating multiple pools with different size chunks and pick the smallest chunk size pool that's still greater than the requested amount. I've used this idea to create allocators that run in O(1).

if you talking about Win32 - you can try to squeeze something by using LARGEADDRESSAWARE. You'll have ~1Gb extra defragmented memory so your application will fragment it longer.

The simple, quick and dirty, solution is to split the application into several process, you should get fresh HEAP each time you create the process.
Your memory and speed might suffer a bit (swapping) but fast hardware and big RAM should be able to help.
This was old UNIX trick with daemons, when threads did not existed yet.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js