Reducing memory footprint of large unfamiliar codebase

Reducing memory footprint of large unfamiliar codebase - c++

Suppose you have a fairly large (~2.2 MLOC), fairly old (started more than 10 years ago) Windows desktop application in C/C++. About 10% of modules are external and don't have sources, only debug symbols.
How would you go about reducing application's memory footprint in half? At least, what would you do to find out where memory is consumed?

Override malloc()/free() and new()/delete() with wrappers that keep track of how big the allocations are and (by recording the callstack and later resolving it against the symbol table) where they are made from. On shutdown, have your wrapper display any memory still allocated.
This should enable you both to work out where the largest allocations are and to catch any leaks.

this is description/skeleton of memory tracing application I used to reduce memory consumption of our game by 20%. It helped me to track many allocations done by external modules.

It's not an easy task. Begin by chasing down any memory leaks you cand find (a good tool would be Rational Purify). Skim the source code and try to optimize data structures and/or algorithms.
Sorry if this may sound pessimistic, but cutting down memory usage by 50% doesn't sound realistic.

There is a chance is you can find some significant inefficiencies very fast. First you should check what is the memory used for. A tool which I have found very handy for this is Memory Validator
Once you have this "memory usage map", you can check for Low Hanging Fruit. Are there any data structures consuming a lot of memory which could be represented in a more compact form? This is often possible, esp. when the data access is well encapsulated and when you have a spare CPU power you can dedicate to compressing / decompressing them on each access.

I don't think your question is well posed.
The size of source code is not directly related to the memory footprint. Sure, the compiled code will occupy some memory but the application might will have memory requirements on it's own. Both static (the variables declared in the code) and dynamic (the object the application creates).
I would suggest you to profile program execution and study the code carefully.

First places to start for me would be:
Does the application do a lot of preallocation memory to be used later? Does this memory often sit around unused, never handed out? Consider switching to newing/deleting (or better use a smart_ptr) as needed.
Does the code use a static array such as
Object arrayOfObjs[MAX_THAT_WILL_EVER_BE_USED];
and hand out objs in this array? If so, consider manually managing this memory.

One of the tools for memory usage analysis is LeakDiag, available for free download from Microsoft. It apparently allows to hook all user-mode allocators down to VirtualAlloc and to dump process allocation snapshots to XML at any time. These snapshots then can be used to determine which call stacks allocate most memory and which call stacks are leaking. It lacks pretty frontend for snapshot analysis (unless you can get LDParser/LDGrapher via Microsoft Premier Support), but all the data is there.
One more thing to note is that you may have false leak positives from BSTR allocator due to caching, see "Hey, why am I leaking all my BSTR's?"

Related

Implementing a memory manager in multithreaded C/C++ with dynamically sized memory pool?

Background: I'm developing a multiplatform framework of sorts that will be used as base for both game and util/tool creation. The basic idea is to have a pool of workers, each executing in its own thread. (Furthermore, workers will also be able to spawn at runtime.) Each thread will have it's own memory manager.
I have long thought about creating my own memory management system, and I think this project will be perfect to finally give it a try. I find such a system fitting due to the types of usages of this framework will often require memory allocations in realtime (games and texture edition tools).
Problems:
No generally applicable solution(?) - The framework will be used for both games/visualization (not AAA, but indie/play) and tool/application creation. My understanding is that for game development it is usual (at least for console games) to allocate a big chunk of memory only once in the initialization, and then use this memory internally in the memory manager. But is this technique applicable in a more general application?
In a game you could theoretically know how much memory your scenes and resources will need, but for example, a photo editing application will load resources of all different sizes... So in the latter case a more dynamic memory "chunk size" would be needed? Which leads me to the next problem:
Moving already allocated data and keeping valid pointers - Normally when allocating on the heap, you will acquire a simple pointer to the memory chunk. In a custom memory manager, as far as I understand it, a similar approach is then to return a pointer to somewhere free in the pre-allocated chunk. But what happens if the pre-allocated chunk is too small and needs to be resized or even defragmentated? The data would be needed to be moved around in the memory and the old pointers would be invalid. Is there a way to transparently wrap these pointers in some way, but still use them as normally "outside" the memory management as if they were usual C++ pointers?
Third party libraries - If there is no way to transparently use a custom memory management system for all memory allocation in the application, every third party library I'm linking with, will still use the "old" OS memory allocations internally. I have learned that it is common for libraries to expose functions to set custom allocation functions that the library will use, but it is not guaranteed every library I will use will have this ability.
Questions: Is it possible and feasible to implement a memory manager that can use a dynamically sized memory chunk pool? If so, how would defragmentation and memory resize work, without breaking currently in-use pointers? And finally, how is such a system best implemented to work with third party libraries?
I'm also thankful for any related reading material, papers, articles and whatnot! :-)

As someone who has previously written many memory managers and heap implementations for AAA games for the last few generations of consoles let me tell you its simply not worth it anymore.
Your information is old - back in the gamecube era [circa 2003] we used to do what you said- allocate a large chunk and carve out that chunk manually using custom algorithms tweaked for each game.
Once virtual memory came along (xbox era), games got more complicated [and so made more allocations and became multimthreaded] address fragmentation made this untenable. So we switched to custom allocators to handle certain types of requests only - for instance physical memory, or lock free small block low fragmentation heaps or thread local cache of recently used blocks.
As built in memory managers become better it gets harder to do better than those - certainly in the general case and a close thing for a specific use cases. Doug Lea Allocator [or whatever the mainstream c++ linux compilers come with now] and the latest Windows low fragmentation heaps are really very good, and you'd do far better investing your time elsewhere.
I've got spreadsheets at work measuring all kinds of metrics for a whole load of allocators - all the big name ones and a fair few I've collected over the years. And basically whilst the specialist allocators can win on a few metrics [lowest overhead per alloc, spacial proximity, lowest fragmentation, etc] for overall metrics the mainstream ones are simply the best.
As a user of your library, my personal preferred option is you just allocate memory when you need it. Use operator new/the new operator and I can use the standard C++ mechanisms to replace those and use my custom heap (if I indeed have one), or alternatively I can use platform specific ways of replacing your allocations (e.g. XMemAlloc on Xbox). I don't need tagging [capturing callstacks is far superior which I can do if I want]. Lower down that list comes you giving me an interface that you'll call when you need to allocate memory - this is just a pain for you to implement and I'll probably just pass it onto operator new anyway. The worst thing you can do is 'know best' and create your own custom heaps. If memory allocation performance is a problem, I'd much rather you share the solution the whole game uses than roll your own.

If you're looking to write your own malloc()/free(), etc., you probably should start by checking out the source code for existing systems such as dlmalloc. This is a hard problem, though, for what it's worth. Writing your own malloc library is Hard. Beating existing general purpose malloc libraries will be Even Harder.

And now, here is the correct answer: DON'T IMPLEMENT YET ANOTHER MEMORY MANAGER.
It is incredibly hard to implement a memory manager that does not fail under different kinds of usage patterns and events. You may be able to build a specific manager that works well under YOUR usage patterns, but to write one which works well for MANY users is a full-time job that almost no one has really done well. Worse, it is fantastically easy to implement a memory manager that works great 99% of the time and then 1% of the time crash or suddenly consume most or all available memory on your system due to unexpected heap fragmentation.
I say this as someone who has written multiple memory managers, watched multiple people write their own memory managers, and watched even more people attempt to write memory managers and fail. This problem is deceptively difficult, not because it's hard to write templated allocators and generic types with inheritance and such, but because the other solutions given in this thread tend to fail under corner types of load behavior. Once you start supporting byte alignments (as all real-world allocators must) then heap fragmentation rears its ugly head. Cute heuristics that work great for small test programs, fail miserably when subjected to large, real-world programs.
And once you get it working, someone else will need: cookies to verify against memory stomps; heap usage reporting; memory pools; pools of pools; memory leak tracking and reporting; heap auditing; chunk splitting and coalescing; thread-local storage; lookasides; CPU and process-level page faulting and protection; setting and checking and clearing "free-memory" patterns aka 0xdeadbeef; and whatever else I can't think of off the top of my head.
Writing yet another memory manager falls squarely under the heading of Premature Optimization. Since there are multiple free, good, memory managers with thousands of hours of development and testing behind them, you have to justify spending the cost of your own time in such a way that the result would provide some sort of measurable improvement over what other people have done, and you can use, for free.
If you are SURE you want to implement your own memory manager (and hopefully you are NOT sure after reading this message), read through the dlmalloc sources in detail, then read through the tcmalloc sources in detail as well, THEN make sure you understand the performance trade-offs in implementing a thread-safe versus a thread-unsafe memory manager, and why the naive implementations tend to give poor performance results.

Prepare more than one solution and let the user of the framework adopt any particular one. Policy classes to the generic allocator you develop would do this nicely.
A nice way to get around this is to wrap up pointers in a class with overloaded * operator. Make the internal data of that class only an index to the memory pool. Now, you can just change the index quickly after a background thread copies the data over.
Most good C++ libraries support allocators and you should implement one. You can also overload the global new so your version gets used. And keep in mind that you generally won't need to think about a library allocating or deallocating a large amount of data, which is generally a responsibility of client code.

Increase in memory footprint. False alarm or a memory leak?

I have a graphics program where I am creating and destroying the same objects over and over again. All in all, there are 140 objects. They get deleted and newed such that the number never increases 140. This is a requirement as it is a stress test, that is I cannot have a memory pool or dummy objects. Now I am fairly certain there aren't any memory leaks. I am also using a memory leak detector which is not reporting any leaks.
The problem is that the memory footprint of the program keeps increasing (albeit quite slowly, slower than the rate at which the objects are being destroyed/created). So my question then is whether an increasing memory footprint is a solid sign for memory leaks or can it sometimes be deceiving?
EDIT: I am using new/delete to create/destroy the objects

It does seem possible that this behavior could come from a situation in which there is no leak.
Is there any chance that your heap is getting fragmented?
Say you make lots of allocations of size n. You free them all, which makes your C library insert those buffers into a free list. Some other code path then makes allocations smaller than n, so those blocks in the free list get chunked up into smaller units. Then the next iteration of the loop does another batch of allocations of size n, and the free list no longer contains contiguous memory at that size, and malloc has to ask the kernel for more memory. Eventually those "smaller-than-n" allocations get freed as would your "n-sized" ones, but if you run enough iterations where the fragmentation exists, I could see the process gradually increasing its memory footprint.
One way to avoid this might be to allocate all your objects once, and not keep allocating/freeing them. Since you're using C++ this might necessitate placement new or something similar. Since you are using Windows, I might also mention that Win32 supports having multiple heaps in a process, so if your objects come from a different heap than other allocations you may avoid this.

It depends if you're under a CLR (or a virtual machine with garbage collector) or your still in the old mode (like C++, MFC ect...)
When you have a GC around - you can't really tell, only if you test it long enough. GC can decide not to clean your objects for now... (there is a way to force it)
In the native applications, yes, a footprint increase might mean a leak.
there are some tools (very good tools) for c++ that find these leaks (google devpartner or boundschecker)
I guess there are some tools for c# and Java as well.

If your application's process footprint increases beyond a reasonable limit, which depends on your application and what it does, and continues to increase until eventually you (will) run out of virtual memory, you definitely have a memory leak.

Try memory allocation tests included in CRT: http://msdn.microsoft.com/en-us/library/e5ewb1h3%28VS.80%29.aspx
They help A LOT.
But I've noticed that apps do tend to vary their memory consumption a little if you look at some factors. Windows 7 might also create extra padding in memory allocation to fix bugs: http://msdn.microsoft.com/en-us/library/dd744764%28VS.85%29.aspx

I strongly suggest to try Visual Studio 2015 (the Community Edition is free). It comes with Diagnostic Tools that helps you analyze Memory Usage; it allows you to take snapshots and view the heap

Any reason to overload global new and delete?

Unless you're programming parts of an OS or an embedded system are there any reasons to do so? I can imagine that for some particular classes that are created and destroyed frequently overloading memory management functions or introducing a pool of objects might lower the overhead, but doing these things globally?
Addition
I've just found a bug in an overloaded delete function - memory wasn't always freed. And that was in a not-so memory critical application. Also, disabling these overloads decreases performance by ~0.5% only.

We overload the global new and delete operators where I work for many reasons:
pooling all small allocations -- decreases overhead, decreases fragmentation, can increase performance for small-alloc-heavy apps
framing allocations with a known lifetime -- ignore all the frees until the very end of this period, then free all of them together (admittedly we do this more with local operator overloads than global)
alignment adjustment -- to cacheline boundaries, etc
alloc fill -- helping to expose usage of uninitialized variables
free fill -- helping to expose usage of previously deleted memory
delayed free -- increasing the effectiveness of free fill, occasionally increasing performance
sentinels or fenceposts -- helping to expose buffer overruns, underruns, and the occasional wild pointer
redirecting allocations -- to account for NUMA, special memory areas, or even to keep separate systems separate in memory (for e.g. embedded scripting languages or DSLs)
garbage collection or cleanup -- again useful for those embedded scripting languages
heap verification -- you can walk through the heap data structure every N allocs/frees to make sure everything looks ok
accounting, including leak tracking and usage snapshots/statistics (stacks, allocation ages, etc)
The idea of new/delete accounting is really flexible and powerful: you can, for example, record the entire callstack for the active thread whenever an alloc occurs, and aggregate statistics about that. You could ship the stack info over the network if you don't have space to keep it locally for whatever reason. The types of info you can gather here are only limited by your imagination (and performance, of course).
We use global overloads because it's convenient to hang lots of common debugging functionality there, as well as make sweeping improvements across the entire app, based on the statistics we gather from those same overloads.
We still do use custom allocators for individual types too; in many cases the speedup or capabilities you can get by providing custom allocators for e.g. a single point-of-use of an STL data structure far exceeds the general speedup you can get from the global overloads.
Take a look at some of the allocators and debugging systems that are out there for C/C++ and you'll rapidly come up with these and other ideas:
valgrind
electricfence
dmalloc
dlmalloc
Application Verifier
Insure++
BoundsChecker
...and many others... (the gamedev industry is a great place to look)
(One old but seminal book is Writing Solid Code, which discusses many of the reasons you might want to provide custom allocators in C, most of which are still very relevant.)
Obviously if you can use any of these fine tools you will want to do so rather than rolling your own.
There are situations in which it is faster, easier, less of a business/legal hassle, nothing's available for your platform yet, or just more instructive: dig in and write a global overload.

The most common reason to overload new and delete are simply to check for memory leaks, and memory usage stats. Note that "memory leak" is usually generalized to memory errors. You can check for things such as double deletes and buffer overruns.
The uses after that are usually memory-allocation schemes, such as garbage collection, and pooling.
All other cases are just specific things, mentioned in other answers (logging to disk, kernel use).

In addition to the other important uses mentioned here, like memory tagging, it's also the only way to force all allocations in your app to go through fixed-block allocation, which has enormous implications for performance and fragmentation.
For example, you may have a series of memory pools with fixed block sizes. Overriding global new lets you direct all 61-byte allocations to, say, the pool with 64-byte blocks, all 768-1024 byte allocs to the the 1024b-block pool, all those above that to the 2048 byte block pool, and anything larger than 8kb to the general ragged heap.
Because fixed block allocators are much faster and less prone to fragmentation than allocating willy-nilly from the heap, this lets you force even crappy 3d party code to allocate from your pools and not poop all over the address space.
This is done often in systems which are time- and space-critical, such as games. 280Z28, Meeh, and Dan Olson have described why.

UnrealEngine3 overloads global new and delete as part of its core memory management system. There are multiple allocators that provide different features (profiling, performance, etc.) and they need all allocations to go through it.
Edit: For my own code, I would only ever do it as a last resort. And by that I mean I would almost positively never use it. But my personal projects are obviously much smaller/very different requirements.

Some realtime systems overload them to avoid them being used after init..

Overloading new & delete makes it possible to add a tag to your memory allocations. I tag allocations per system or control or by middleware. I can view, at runtime, how much each uses. Maybe I want to see the usage of a parser separated from the UI or how much a piece of middleware is really using!
You can also use it to put guard bands around the allocated memory. If/when your app crashes you can take a look at the address. If you see the contents as "0xABCDABCD" (or whatever you choose as guard) you are accessing memory you don't own.
Perhaps after calling delete you can fill this space with a similarly recognizable pattern.
I believe VisualStudio does something similar in debug. Doesn't it fill uninitialized memory with 0xCDCDCDCD?
Finally, if you have fragmentation issues you could use it to redirect to a block allocator? I am not sure how often this is really a problem.

You need to overload them when the call to new and delete doesn't work in your environment.
For example, in kernel programming, the default new and delete don't work as they rely on user mode library to allocate memory.

From a practical standpoint it may just be better to override malloc on a system library level, since operator new will probably be calling it anyway.
On linux, you can put your own version of malloc in place of the system one, as in this example here:
http://developers.sun.com/solaris/articles/lib_interposers.html
In that article, they are trying to collect performance statistics. But you may also detect memory leaks if you also override free.
Since you are doing this in a shared library with LD_PRELOAD, you don't even need to recompile your application.

I've seen it done in a system that for 'security'* reasons was required to write over all memory it used on de-allocation. The approach was to allocate an extra few bytes at the start of each block of memory which would contain the size of the overall block which would then be overwritten with zeros on delete.
This had a number of problems as you can probably imagine but it did work (mostly) and saved the team from reviewing every single memory allocation in a reasonably large, existing application.
Certainly not saying that it is a good use but it is probably one of the more imaginative ones out there...
* sadly it wasn't so much about actual security as the appearance of security...

Photoshop plugins written in C++ should override operator new so that they obtain memory via Photoshop.

I've done it with memory mapped files so that data written to the memory is automatically also saved to disk.
It's also used to return memory at a specific physical address if you have memory mapped IO devices, or sometimes if you need to allocate a certain block of contiguous memory.
But 99% of the time it's done as a debugging feature to log how often, where, when memory is being allocated and released.

It's actually pretty common for games to allocate one huge chunk of memory from the system and then provide custom allocators via overloaded new and delete. One big reason is that consoles have a fixed memory size, making both leaks and fragmentation large problems.
Usually (at least on a closed platform) the default heap operations come with a lack of control and a lack of introspection. For many applications this doesn't matter, but for games to run stably in fixed-memory situations the added control and introspection are both extremely important.

It can be a nice trick for your application to be able to respond to low memory conditions by something else than a random crash. To do this your new can be a simple proxy to the default new that catches its failures, frees up some stuff and tries again.
The simplest technique is to reserve a blank block of memory at start-up time for that very purpose. You may also have some cache you can tap into - the idea is the same.
When the first allocation failure kicks in, you still have time to warn your user about the low memory conditions ("I'll be able to survive a little longer, but you may want to save your work and close some other applications"), save your state to disk, switch to survival mode, or whatever else makes sense in your context.

The most common use case is probably leak checking.
Another use case is when you have specific requirements for memory allocation in your environment which are not satisfied by the standard library you are using, like, for instance, you need to guarantee that memory allocation is lock free in a multi threaded environment.

As many have already stated this is usually done in performance critical applications, or to be able to control memory alignment or track your memory. Games frequently use custom memory managers, especially when targeting specific platforms/consoles.
Here is a pretty good blog post about one way of doing this and some reasoning.

Overloaded new operator also enables programmers to squeeze some extra performance out of their programs. For example, In a class, to speed up the allocation of new nodes, a list of deleted nodes is maintained so that their memory can be reused when new nodes are allocated.In this case, the overloaded delete operator will add nodes to the list of deleted nodes and the overloaded new operator will allocate memory from this list rather than from the heap to speedup memory allocation. Memory from the heap can be used when the list of deleted nodes is empty.

Memory management while loading huge XML files

We have an application which imports objects from an XML. The XML is around 15 GB. The application invariably starts running out of memory. We tried to free memory in between operations but this has lead to degrading performance. i.e it takes more time to complete the import operation. The CPU utilization reaches 100%
The application is written in C++.
Does the frequent call to free() will lead to performance issues?
Promoted from a comment by the OP: the parser being used in expat, which is a SAX parser with a very small footprint, and customisable memory management.

Use SAX parser instead of DOM parser.

Have you tried resuing the memory and your classes as opposed to freeing and reallocating it? Constant allocation/deallocation cycles, especially if they are coupled with small (less than 4096 bytes) data fragments can lead to serious performance problems and memory address space fragmentation.

Profile the application during one of these bothersome loads, to see where it is spending most of its time.
I believe that free() can sometimes be costly, but that is of course very much dependent on the platform's implementation.
Also, you don't say a lot about the lifelength of the objects loaded; if the XML is 15 GB, how much of that is kept around for each "object", once the markup is parsed and thrown away?
It sounds sensible to process an input document of this size in a streaming fashion, i.e. not trying a DOM-approach which loads and builds the entire XML parse tree at once.

If you want to minimise your memory usage, took a look at How to read the XML data from a file by using Visual C++.

One thing that often helps is to use a lightweight low-overhead memory pool. If you combine this with "frame" allocation methods (ignoring any delete/free until you're all done with the data), you can get something that's ridiculously fast.
We did this for an embedded system recently, mostly for performance reasons, but it saved a lot of memory as well.
The trick was basically to allocate a big block -- slightly bigger than we'd need (you could allocate a chain of blocks if you like) -- and just keep returning a "current" pointer (bumping it up by allocSize, rounded up to maximum align requirement of 4 in our case, each time). This cut our overhead per alloc from on the order of 52-60 bytes down to <= 3 bytes. We also ignored "free" calls until we were all done parsing and then freed the whole block.
If you're clever enough with your frame allocation you can save a lot of space and time. It might not get you all the way to your 15GiB, but it would be worth looking at how much space overhead you really have... My experience with DOM-based systems is that they use tons of small allocs, each with a relatively high overhead.
(If you have virtual memory, a large "block" might not even hurt that much, if your access at any given time is local to a page or three anyway...)
Obviously you have to keep the memory you actually need in the long run, but the parser's "scratch memory" becomes a lot more efficient this way.

We tried to free memory in between operations but this has lead to degrading performance .
Does the frequent call to free() will lead to performance issues ?
Based on the evidence supplied, yes.
Since your already using expat, a SAX parser, what exactly are you freeing? If you can free it, why are you mallocing it in a loop in the first place?

Maybe, it should say profiler.
Also don't forget that work with heap is single-thread. I mean that if booth of your threads will allocate/free memory in ont time, one of them will waiting when first will done.
If you allocating and free memory for same objects, you could create pool of this object and do allocate/free once.

Try and find a way to profile your code.
If you have no profiler, try and organize your work so that you only have a few free() commands (instead of the many you suggest).
It is very common to have to find the right balance between memory consumption and time efficiency.

I did not try it myself, but have you heard of XMLLite, there's an MSDN artical introducing it. It's used by MS Office internally.

How to solve Memory Fragmentation

We've occasionally been getting problems whereby our long-running server processes (running on Windows Server 2003) have thrown an exception due to a memory allocation failure. Our suspicion is these allocations are failing due to memory fragmentation.
Therefore, we've been looking at some alternative memory allocation mechanisms that may help us and I'm hoping someone can tell me the best one:
1) Use Windows Low-fragmentation Heap
2) jemalloc - as used in Firefox 3
3) Doug Lea's malloc
Our server process is developed using cross-platform C++ code, so any solution would be ideally cross-platform also (do *nix operating systems suffer from this type of memory fragmentation?).
Also, am I right in thinking that LFH is now the default memory allocation mechanism for Windows Server 2008 / Vista?... Will my current problems "go away" if our customers simply upgrade their server os?

First, I agree with the other posters who suggested a resource leak. You really want to rule that out first.
Hopefully, the heap manager you are currently using has a way to dump out the actual total free space available in the heap (across all free blocks) and also the total number of blocks that it is divided over. If the average free block size is relatively small compared to the total free space in the heap, then you do have a fragmentation problem. Alternatively, if you can dump the size of the largest free block and compare that to the total free space, that will accomplish the same thing. The largest free block would be small relative to the total free space available across all blocks if you are running into fragmentation.
To be very clear about the above, in all cases we are talking about free blocks in the heap, not the allocated blocks in the heap. In any case, if the above conditions are not met, then you do have a leak situation of some sort.
So, once you have ruled out a leak, you could consider using a better allocator. Doug Lea's malloc suggested in the question is a very good allocator for general use applications and very robust most of the time. Put another way, it has been time tested to work very well for most any application. However, no algorithm is ideal for all applications and any management algorithm approach can be broken by the right pathelogical conditions against it's design.
Why are you having a fragmentation problem? - Sources of fragmentation problems are caused by the behavior of an application and have to do with greatly different allocation lifetimes in the same memory arena. That is, some objects are allocated and freed regularly while other types of objects persist for extended periods of time all in the same heap.....think of the longer lifetime ones as poking holes into larger areas of the arena and thereby preventing the coalesce of adjacent blocks that have been freed.
To address this type of problem, the best thing you can do is logically divide the heap into sub arenas where the lifetimes are more similar. In effect, you want a transient heap and a persistent heap or heaps that group things of similar lifetimes.
Some others have suggested another approach to solve the problem which is to attempt to make the allocation sizes more similar or identical, but this is less ideal because it creates a different type of fragmentation called internal fragmentation - which is in effect the wasted space you have by allocating more memory in the block than you need.
Additionally, with a good heap allocator, like Doug Lea's, making the block sizes more similar is unnecessary because the allocator will already be doing a power of two size bucketing scheme that will make it completely unnecessary to artificially adjust the allocation sizes passed to malloc() - in effect, his heap manager does that for you automatically much more robustly than the application will be able to make adjustments.

I think you’ve mistakenly ruled out a memory leak too early.
Even a tiny memory leak can cause a severe memory fragmentation.
Assuming your application behaves like the following:
Allocate 10MB
Allocate 1 byte
Free 10MB
(oops, we didn’t free the 1 byte, but who cares about 1 tiny byte)
This seems like a very small leak, you will hardly notice it when monitoring just the total allocated memory size.
But this leak eventually will cause your application memory to look like this:
.
.
Free – 10MB
.
.
[Allocated -1 byte]
.
.
Free – 10MB
.
.
[Allocated -1 byte]
.
.
Free – 10MB
.
.
This leak will not be noticed... until you want to allocate 11MB
Assuming your minidumps had full memory info included, I recommend using DebugDiag to spot possible leaks.
In the generated memory report, examine carefully the allocation count (not size).

As you suggest, Doug Lea's malloc might work well. It's cross platform and it has been used in shipping code. At the very least, it should be easy to integrate into your code for testing.
Having worked in fixed memory environments for a number of years, this situation is certainly a problem, even in non-fixed environments. We have found that the CRT allocators tend to stink pretty bad in terms of performance (speed, efficiency of wasted space, etc). I firmly believe that if you have extensive need of a good memory allocator over a long period of time, you should write your own (or see if something like dlmalloc will work). The trick is getting something written that works with your allocation patterns, and that has more to do with memory management efficiency as almost anything else.
Give dlmalloc a try. I definitely give it a thumbs up. It's fairly tunable as well, so you might be able to get more efficiency by changing some of the compile time options.
Honestly, you shouldn't depend on things "going away" with new OS implementations. A service pack, patch, or another new OS N years later might make the problem worse. Again, for applications that demand a robust memory manager, don't use the stock versions that are available with your compiler. Find one that works for your situation. Start with dlmalloc and tune it to see if you can get the behavior that works best for your situation.

You can help reduce fragmentation by reducing the amount you allocate deallocate.
e.g. say for a web server running a server side script, it may create a string to output the page to. Instead of allocating and deallocating these strings for every page request, just maintain a pool of them, so your only allocating when you need more, but your not deallocating (meaning after a while you get the situation you not allocating anymore either, because you have enough)
You can use _CrtDumpMemoryLeaks(); to dump memory leaks to the debug window when running a debug build, however I believe this is specific to the Visual C compiler. (it's in crtdbg.h)

I'd suspect a leak before suspecting fragmentation.
For the memory-intensive data structures, you could switch over to a re-usable storage pool mechanism. You might also be able to allocate more stuff on the stack as opposed to the heap, but in practical terms that won't make a huge difference I think.
I'd fire up a tool like valgrind or do some intensive logging to look for resources not being released.

#nsaners - I'm pretty sure the problem is down to memory fragmentation. We've analyzed minidumps that point to a problem when a large (5-10mb) chunk of memory is being allocated. We've also monitored the process (on-site and in development) to check for memory leaks - none were detected (the memory footprint is generally quite low).

The problem does happen on Unix, although it's usually not as bad.
The Low-framgmentation heap helped us, but my co-workers swear by Smart Heap
(it's been used cross platform in a couple of our products for years). Unfortunately due to other circumstances we couldn't use Smart Heap this time.
We also look at block/chunking allocating and trying to have scope-savvy pools/strategies, i.e.,
long term things here, whole request thing there, short term things over there, etc.

As usual, you can usually waste memory to gain some speed.
This technique isn't useful for a general purpose allocator, but it does have it's place.
Basically, the idea is to write an allocator that returns memory from a pool where all the allocations are the same size. This pool can never become fragmented because any block is as good as another. You can reduce memory wastage by creating multiple pools with different size chunks and pick the smallest chunk size pool that's still greater than the requested amount. I've used this idea to create allocators that run in O(1).

if you talking about Win32 - you can try to squeeze something by using LARGEADDRESSAWARE. You'll have ~1Gb extra defragmented memory so your application will fragment it longer.

The simple, quick and dirty, solution is to split the application into several process, you should get fresh HEAP each time you create the process.
Your memory and speed might suffer a bit (swapping) but fast hardware and big RAM should be able to help.
This was old UNIX trick with daemons, when threads did not existed yet.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js