How to debug heap corruption errors?

How to debug heap corruption errors? - c++

I am debugging a (native) multi-threaded C++ application under Visual Studio 2008. On seemingly random occasions, I get a "Windows has triggered a break point..." error with a note that this might be due to a corruption in the heap. These errors won't always crash the application right away, although it is likely to crash short after.
The big problem with these errors is that they pop up only after the corruption has actually taken place, which makes them very hard to track and debug, especially on a multi-threaded application.
What sort of things can cause these errors?
How do I debug them?
Tips, tools, methods, enlightments... are welcome.

Application Verifier combined with Debugging Tools for Windows is an amazing setup. You can get both as a part of the Windows Driver Kit or the lighter Windows SDK. (Found out about Application Verifier when researching an earlier question about a heap corruption issue.) I've used BoundsChecker and Insure++ (mentioned in other answers) in the past too, although I was surprised how much functionality was in Application Verifier.
Electric Fence (aka "efence"), dmalloc, valgrind, and so forth are all worth mentioning, but most of these are much easier to get running under *nix than Windows. Valgrind is ridiculously flexible: I've debugged large server software with many heap issues using it.
When all else fails, you can provide your own global operator new/delete and malloc/calloc/realloc overloads -- how to do so will vary a bit depending on compiler and platform -- and this will be a bit of an investment -- but it may pay off over the long run. The desirable feature list should look familiar from dmalloc and electricfence, and the surprisingly excellent book Writing Solid Code:
sentry values: allow a little more space before and after each alloc, respecting maximum alignment requirement; fill with magic numbers (helps catch buffer overflows and underflows, and the occasional "wild" pointer)
alloc fill: fill new allocations with a magic non-0 value -- Visual C++ will already do this for you in Debug builds (helps catch use of uninitialized vars)
free fill: fill in freed memory with a magic non-0 value, designed to trigger a segfault if it's dereferenced in most cases (helps catch dangling pointers)
delayed free: don't return freed memory to the heap for a while, keep it free filled but not available (helps catch more dangling pointers, catches proximate double-frees)
tracking: being able to record where an allocation was made can sometimes be useful
Note that in our local homebrew system (for an embedded target) we keep the tracking separate from most of the other stuff, because the run-time overhead is much higher.
If you're interested in more reasons to overload these allocation functions/operators, take a look at my answer to "Any reason to overload global operator new and delete?"; shameless self-promotion aside, it lists other techniques that are helpful in tracking heap corruption errors, as well as other applicable tools.
Because I keep finding my own answer here when searching for alloc/free/fence values MS uses, here's another answer that covers Microsoft dbgheap fill values.

You can detect a lot of heap corruption problems by enabling Page Heap for your application . To do this you need to use gflags.exe that comes as a part of Debugging Tools For Windows
Run Gflags.exe and in the Image file options for your executable, check "Enable Page Heap" option.
Now restart your exe and attach to a debugger. With Page Heap enabled, the application will break into debugger whenever any heap corruption occurs.

To really slow things down and perform a lot of runtime checking, try adding the following at the top of your main() or equivalent in Microsoft Visual Studio C++
_CrtSetDbgFlag(_CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF | _CRTDBG_CHECK_ALWAYS_DF );

A very relevant article is Debugging Heap corruption with Application Verifier and Debugdiag.

What sort of things can cause these errors?
Doing naughty things with memory, e.g. writing after the end of a buffer, or writing to a buffer after it's been freed back to the heap.
How do I debug them?
Use an instrument which adds automated bounds-checking to your executable: i.e. valgrind on Unix, or a tool like BoundsChecker (Wikipedia suggests also Purify and Insure++) on Windows.
Beware that these will slow your application, so they may be unusable if yours is a soft-real-time application.
Another possible debugging aid/tool might be MicroQuill's HeapAgent.

One quick tip, that I got from Detecting access to freed memory is this:
If you want to locate the error
quickly, without checking every
statement that accesses the memory
block, you can set the memory pointer
to an invalid value after freeing the
block:
#ifdef _DEBUG // detect the access to freed memory
#undef free
#define free(p) _free_dbg(p, _NORMAL_BLOCK); *(int*)&p = 0x666;
#endif

The best tool I found useful and worked every time is code review (with good code reviewers).
Other than code review, I'd first try Page Heap. Page Heap takes a few seconds to set up and with luck it might pinpoint your problem.
If no luck with Page Heap, download Debugging Tools for Windows from Microsoft and learn to use the WinDbg. Sorry couldn't give you more specific help, but debuging multi-threaded heap corruption is more an art than science. Google for "WinDbg heap corruption" and you should find many articles on the subject.

What type of allocation functions are you using? I recently hit a similar error using the Heap* style allocation functions.
It turned out that I was mistakenly creating the heap with the HEAP_NO_SERIALIZE option. This essentially makes the Heap functions run without thread safety. It's a performance improvement if used properly but shouldn't ever be used if you are using HeapAlloc in a multi-threaded program [1]. I only mention this because your post mentions you have a multi-threaded app. If you are using HEAP_NO_SERIALIZE anywhere, delete that and it will likely fix your problem.
[1] There are certain situations where this is legal, but it requires you to serialize calls to Heap* and is typically not the case for multi-threaded programs.

If these errors occur randomly, there is high probability that you encountered data-races. Please, check: do you modify shared memory pointers from different threads? Intel Thread Checker may help to detect such issues in multithreaded program.

You may also want to check to see whether you're linking against the dynamic or static C runtime library. If your DLL files are linking against the static C runtime library, then the DLL files have separate heaps.
Hence, if you were to create an object in one DLL and try to free it in another DLL, you would get the same message you're seeing above. This problem is referenced in another Stack Overflow question, Freeing memory allocated in a different DLL.

In addition to looking for tools, consider looking for a likely culprit. Is there any component you're using, perhaps not written by you, which may not have been designed and tested to run in a multithreaded environment? Or simply one which you do not know has run in such an environment.
The last time it happened to me, it was a native package which had been successfully used from batch jobs for years. But it was the first time at this company that it had been used from a .NET web service (which is multithreaded). That was it - they had lied about the code being thread safe.

You can use VC CRT Heap-Check macros for _CrtSetDbgFlag: _CRTDBG_CHECK_ALWAYS_DF or _CRTDBG_CHECK_EVERY_16_DF.._CRTDBG_CHECK_EVERY_1024_DF.

I'd like to add my experience. In the last few days, I solved an instance of this error in my application. In my particular case, the errors in the code were:
Removing elements from an STL collection while iterating over it (I believe there are debug flags in Visual Studio to catch these things; I caught it during code review)
This one is more complex, I'll divide it in steps:
From a native C++ thread, call back into managed code
In managed land, call Control.Invoke and dispose a managed object which wraps the native object to which the callback belongs.
Since the object is still alive inside the native thread (it will remain blocked in the callback call until Control.Invoke ends). I should clarify that I use boost::thread, so I use a member function as the thread function.
Solution: Use Control.BeginInvoke (my GUI is made with Winforms) instead so that the native thread can end before the object is destroyed (the callback's purpose is precisely notifying that the thread ended and the object can be destroyed).

I had a similar problem - and it popped up quite randomly. Perhaps something was corrupt in the build files, but I ended up fixing it by cleaning the project first then rebuilding.
So in addition to the other responses given:
What sort of things can cause these errors?
Something corrupt in the build file.
How do I debug them?
Cleaning the project and rebuilding. If it's fixed, this was likely the problem.

I have also faced this issue. In my case, I allocated for x size memory and appended the data for x+n size. So, when freeing it shown heap overflow. Just make sure your allocated memory sufficient and check for how many bytes added in the memory.

Related

Freeing memory very slow with Visual debugger

In our applications we are using a class that utilizes a lot of big Qt containers. If objects of that class are destroyed while a Visual Studio debugger is attached to the process, freeing of the memory is extremly slow (multiple minutes). Still it works fine but simply very slow.
Some time ago I already could confirm that the debugger's memory checking is reponsible for this. It is a known issue.
I worked around this issue by simply stopping the application by debugger if it is stuck in freeing memory or by starting it without an attached debugger.
However now I need to debug code that periodically deallocates such objects. Of course it works but it is inacceptable slow and since I need to do a lot of cycles I need a better solution.
Is there any way to disable heap memory checking in VS2013 debugger? Or is there a way to exclude some variables from those checks?

I tried another solution. Setting the environment variable _NO_DEBUG_HEAP to 1 actually solves my problem.
This can be done on system level (using a common environment variable editor e.g. REE) or on application level by setting the Visual Studio project property Debugging/Environment to "_NO_DEBUG_HEAP=1".
I found this information in an answer to my question that unfortunately isn't visible anymore (I suppose it was deleted by moderators). It contained this link that was actually very helpful.

_CrtSetDbgFlag controls what kinds of checking the debug heap does.
It's possible your code (or perhaps a library you're using) has cranked up the checking level. For example, you can have it check the integrity of the heap on every allocation and deallocation. That can cause a huge slowdown. Don't use _CRTDBG_CHECK_ALWAYS_DF unless you actually need it to find a specific memory problem.
For basic leak checking, which generally isn't a huge performance penalty, you need only (_CRTDBG_LEAK_CHECK_DF | _CRTDBG_ALLOC_MEM_DF).

Out of virtual memory address space (Borland C++ Builder 6 program)

I have problem with some application written under C++ Builder 6. After some time of running (week, month) the application crashes and closes without any error message. In my application log shortly before crash i get many "Out of memory" exceptions.
I looked at the process when it was throwing out of memory exceptions (screenshot below) and it has lots of uncommitted private memory space. What can be a reason of such behavior?
I had such problem once, couple years ago. The reason for that was an option "use dynamic libraries" unchecked in linker options. When I checked it back the problem disappeared and vice versa. The test application which I made was just calling "new char[1000000]" and then delete. The memory was freed every time (no committed memory rise in windows task manager), but after some time I got out of memory, VMMap showed exactly the same thing. Lots of reserved private memory but most of it uncommitted.
Now the problem returned but I can't fix it the same way. I don't know if that was the reason but I had Builder 6 and 2010 istalled on the same machine. Now I just have Builder 6 and it seems that I cannot reproduce the error with test application like before. Ether way it seems that there is some memory manager error or something. CodeGuard doesn't show any memory leaks. When I create memory block with "new" it instantly shows in "memory commit size" and when delete the memory usage decreases, so I assume that the memory leaks are not the case, task manager doesn't show much "memory commit size".
Is there anything I can do? Is there any way I can release uncommitted memory? How to diagnose the problem any further?
The screenshot:
http://i.stack.imgur.com/UKuTZ.jpg

I found a way to simulate this problem and solution.
for(int i=0; i<100; i++)
{
char * b = new char[100000000];
new char;
delete b;
}
Borland memory manager reserves a block of memory which size is multiple of one page which is 4kB. When allocating memory size different than multiple of 4kB there is some free space which borland may use to allocate some other memory chunk. When the first chunk is deallocated the second is still keeping hole memory block reserved.
At first look the code should cause just 100B memory leak, but in fact it will cause memory allocation exception after less than 16 iterations.
I have found two solutions for this problem. One is FastMM, it works but also brings some troubles with it too.
Second solution is to exchange borlndmm.dll with the one from Embarcadero Rad Studio 2010. I didn't test it thoroughly yet but it seems to work without any problem.
I should move the hole project to RAD 2010 but for some reasons I got stuck in Borland 6.

Prologue
Hmm interesting behaviour... I have to add some thing I learned the hard way. I dismiss BCB6 immediately after try it for few times because it had too much bugs for my taste (in comparison with BCB5 especially with AnsiStrings handling). So I stayed with BCB5 for a long time without any problems. I used it even for a very Big projects like CAD/CAM.
After few years pass I had to move to BDS2006 because of my employer and the problems start (some can be similar to yours). Aside the minor IDE and trace/breakpoint/codeguard bugs there are more important things like:
memory manager
delete/delete[] corrupts memory manager if called twice for the same pointer without throwing any exception to notify ...
wrong default constructor/destructor for struct compiler bug was the biggest problem I had (in combination with the delete)
wrong or missing member functions in classes can cause multiple delete call !!! due t bug either in compiler or in C++ engine.
but I was lucky and solve it here: bds 2006 C hidden memory manager conflicts (class new / delete[] vs. AnsiString)
wrong compile
Sometimes app is compiled wrongly, no error is thrown but some lines of code are missing in exe and/or are in different order then in source code. I saw this occasionally also in BCB 5,6. To solve that:
delete all temp files like ~,obj,tds,map,exe,...
close IDE and open it again just to be sure (sometimes view of local variables (mostly big arrays) corrupt IDE memory)
compile again
beware breakpoint/trace/codeguard behave differently then raw app
especially with multi-threading App behaves different while traced and while not. Also codeguard do a big difference (and I do not mean the slow down of execution which corrupts sensitive timing). For example codeguard has a nasty habit of throwing out of memory exceptions without a reason sometimes so some parts of code must be checked over and over until it goes through sometimes even if the mem usage is still the same and far from out of mem.
AnsiString operators
There are two types of AnsiString in VCL normal and component property. So it is wise to take that into account because for component property AnsiString's are the operation of operators different. Try for example something like
Edit1->Text+="xxx";
also there are still AnsiString operator bugs like this:
AnsiString version="aaa"+AnsiString("aaa")+"aaa"; // codeguard: array access violation
Import older BCB projects
Avoid direct import if possible it often creates some unknown allocation and memleaks errors. I am not sure why but I suspect that imported window classes are handled differently and the memleaks are related to bullet #1. Better way is create new App and create/copy components and code manually. I know it is backword but the only safe way to avoid problems still don't know where is the problem but simple *.bdsproj replacement will not help !!! And in *.dfm I had not seen anything suspicious.

How do I replace global operator new and delete in an MFC application (debug only)

I've avoided trying to do anything with operator new for many years, due to my sense that it is a quagmire on Windows (especially using MFC). Not to mention, unless one has a very compelling reason to mess with global (or even class) new and delete, one should not.
However, I have a nasty little memory corruption bug, and I'd very much like to track it down. I get messages from the CRT debug allocator indicating that previously freed memory was overwritten. This message is displayed only during a later allocation call, when it tries to reuse a block (I believe this is how it works, anyway).
Due to the portion of code in question, the error message and point of corruption are very unrelated. All I know is "something somewhere overwrote some memory that was previously freed with a single null byte." (I ascertained this by using the debugger and watching the memory referred to by the debug heap over several different runs).
Having exhausted the obvious ideas as to where the culprit might be, I'm left with trying to do something more rigorous. It occurred to me that it would be ideal if I could cause each freed block to become a no-access page of memory so that the writer would be immediately caught by the CPU's MMC! A little bit of searching later, and I found someone who had implemented something along those lines:
http://www.codeproject.com/Articles/38340/Immediate-memory-corruption-detection
His code is buried under a ton of reinvent-the-wheel code, but extracting the core concept was pretty easy, and I have done so.
The problem I have now is that MFC redfines new as DEBUG_NEW, and then further defines a slew of debug interfaces down to the CRT. In addition, it does define the global operator new and delete. Hence, as far as C++ is concerned, the "user" is trying to replace global operator new and delete twice, and hence I get a linker error to the effect 'symbol already defined.'
Looking around the internet, and SO, I see a few promising articles, but none which ultimately have anything positive to say about replacing global operator new/delete for MFC.
How to properly replace global new & delete operators
Is it possible to replace the memory allocator in a debug build of an MFC application?
I am already aware:
MFC/CRT already provides rich debugging tools for memory allocation.
Well, it provides what it provides - such as the message that got me rolling down this path in the first place. I now know that a corruption is occurring, but that's awfully weak-sauce!
What I would like to supply is guarded-allocation (or even just guarded deallocation). This is clearly possible by using a lot of virtual address space and placing every allocation in isolation, which is horribly wasteful of memory. Okay, yeah, can't see the down side when this is debug-only code useful for special-purpose moments like now.
So, I'm desperately seeking solutions to the following
Force the compiler to be copacetic with my global operator new/delete despite the CRT/MFC supplied one.
Find another way to hook the MFC/CRT _heap_alloc_dbg chain to bottom out at using my own code in place of theirs, for the pen-ultimate allocation (i.e. I'll allocate via the OS's VirtualAlloc/VirtualFree to supply memory for new and/or malloc).
Does anyone know of answers, or good articles to read that might shed some light on how that these may be accomplished?
Other ideas:
Replace the CRT's new/delete at runtime using a thunk technique.
Some other approach entirely?!
Further investigation:
This article is pretty cool... it gives a way for me to patch the global new/delete operators at runtime. However, as the article points out, it's a bit hackish (however, since I only need this for debug builds, that's not a big deal) http://zeuxcg.blogspot.com/2009/03/fighting-against-crt-heap-and-winning.html
So although this is getting at what I want (a mechanism to replace the CRT memory allocation functions), this implementation is pretty far out of date, and so far my attempts to make it work have run into myriad issues. I think it's just too hacked to the version it was originally created for, and only for a relatively simple console use (i.e. C, not even C++, and jettisoning most of the debugging features provided by the microsoft CRT). Hence, although a super-cool idea, one that ultimately would cost many hours of effort to make work with the current VS2010 dev studio, and hence not worth it (to me).
Apparently there is a well-known version of this idea: http://en.wikipedia.org/wiki/Electric_Fence Which unfortunately even the Windows port I found http://code.google.com/p/electric-fence-win32/ fails to override the CRT properly, but asks that you modify all of your source code to access the electric fence heap allocation code. :(
Update 5/3/2012:
And now I discover that Windows already provides an implementation of Electric Fence, accessible via GFLAGS debugging tool http://support.microsoft.com/kb/286470 This can be turned on and off external to the application being tested. It's essentially the same technology as I was interested in, and has the features in the DUMA project (a branch of the Electric Fence - http://duma.sourceforge.net/

The MSVCRT debug heap is actually pretty good and has some useful features you can use, such as breakpoint on the nth allocation etc.
http://msdn.microsoft.com/en-us/library/974tc9t1(v=VS.80).aspx
Among other things you can insert an allocation hook which outputs debugging information etc which you can use to debug this sort of issue.
http://msdn.microsoft.com/en-us/library/z2zscsc2(v=vs.80).aspx
In your case all you really need to do is output the address, and file and line of each allocation. Then when you experience a corrupt block, find the block whose address immediately precedes it, which will almost certainly be the one which overran. You can use the memory view in the Visual Studio debugger to look at the memory address which was corrupted and see the preceding blocks. That should tell you all you need to know to find out when it was allocated.
The debug heap also has a numerical allocation ID on each block allocated, and can break on the nth! allocation, so if you can get a reasonably consistent repro, so the same numerical block is corrupted each time, then you should be able to use the "break on nth" functionality to get a full call stack for the time it was allocated.
You may also find _CrtCheckMemory useful to find out if corruption has occurred much earlier. Just call it periodically, and once you have the bug bracketed (error didn't occur in one, did occur in the other) move them closer and closer together.

Why do certain things never crash whith debugger on?

My application uses GLUTesselator to tesselate complex concave polygons. It randomly crashes when I run the plain release exe, but it never crashes if I do start debugging in VS. I found this right here which is basically my problem:
The multi-thread debug CRT (/MTd) masks the problem, because, like
Windows does with processes spawned by
a debugger, it provides to your
program a debug heap, that is
initialized to the 0xCD pattern.
Probably somewhere you use some
uninitialized area of memory from the
heap as a pointer and you dereference
it; with the two debug heaps you get
away with it for some reason (maybe
because at address 0xbaadf00d and
0xcdcdcdcd there's valid allocated
memory), but with the "normal" heap
(which is often initialized to 0) you
get an access violation, because you
dereference a NULL pointer.
The problem is the crash occurs in GLU32.dll and I have no way to find out why its trying to dereference a null pointer sometimes. it seems to do this when my polygons get fairly large and have lots of points. What can I do?
Thanks

It's a fact of life that sometimes programs behave differently in the debugger. In your case, some memory is initialized differently, and it's probably laid out differently as well. Another common case in concurrent programs is that the timing is different, and race conditions often happen less often in a debugger.
You could try to manually initialize the heap to a different value (or see if there is an option for this in Visual Studio). Usually initializing to nonzero catches more bugs, but that may not be the case in your situation. You could also try to play with your program's memory mapping to arrange that the page 0xcdcdc000 is unmapped.
Visual Studio can set a breakpoint on accesses to a particular memory address, you could try this (it may slow your program significantly more than a variable breakpoint).

but it never crashes if I do start debugging in VS.
Well, I'm not sure exactly why but while debugging in visual studio program sometimes can get away with accessing some memory regions that would crash it without debugger. I do not know exact reasons, though, but sometimes 0xcdcdcdcd and 0xbaadfood doesn't have anything to do with that. It is just accessing certain addresses doesn't cause problems. When this happens, you'll need to find alternative methods of guessing the problem.
What can I do?
Possible solutions:
Install exception handler in your program (_set_se_translator, if I remember correctly). On access violation try MinidumpWriteDump. Debug it later using Visual Studio (afaik, crash dump debugging is n/a in express edition), or using windbg.
Use just-in-time debuggers. Non-express edition of visual studio have this feature. There are probably alternatives.
Write custom memory manager (that'll override new/delete and will provide malloc/free alternatives (if you use them)) that will grab large chunk of memory, lock all unused memory with VirtualProtect. In this case all invalid access will cause crashes even in debug mode. You'll need a lot of memory for such memory manager, because to be locked, each block should be aligned to pages.
Add excessive logging to all suspicious function calls. Dump a lot of text/debug information into file (or stderr) - parameter values, arrays, everything you suspect could be related to crash, flush after every write to file, otherwise some info will be lost during the crash. This way you'll be able to guess what happened before program crashed.
Try debugging release build. You should be able to do it to some extent if you enable "debug information" for release build in project settings.
Try switching on/off "basic runtime checks" and "buffer security check" in project properties (configuration properties->c/c++->code genration).
Try to find some kind of external tool - something like valgrind or bounds checker. Although, to my expereinece, #3 is more reliable than that approach. Although that really depends on the problem.

A link to an earlier question and two thoughts.
First off you may want to look at a previous question about valgrind substitutes for windows. Lots of good hints on programs that will help you.
Now the thoughts:
1) The debugger may stop your program from crashing in the code you're testing, but it's not fixing the problem. At worst you're just kicking the can down the street, there's still corruption but it's not evident from the way you're running. When you ship you can be assured someone will run into the problem again.
2) What often happens in cases like this is that the error isn't near where the problem occurs. While you may be noticing the problem in GLU32.dll, there was probably corruption earlier, maybe even in a different thread or function, which didn't cause a problem and at some later point the program came back to the corrupted region and failed.

Heap corruption under Win32; how to locate?

I'm working on a multithreaded C++ application that is corrupting the heap. The usual tools to locate this corruption seem to be inapplicable. Old builds (18 months old) of the source code exhibit the same behaviour as the most recent release, so this has been around for a long time and just wasn't noticed; on the downside, source deltas can't be used to identify when the bug was introduced - there are a lot of code changes in the repository.
The prompt for crashing behaviuor is to generate throughput in this system - socket transfer of data which is munged into an internal representation. I have a set of test data that will periodically cause the app to exception (various places, various causes - including heap alloc failing, thus: heap corruption).
The behaviour seems related to CPU power or memory bandwidth; the more of each the machine has, the easier it is to crash. Disabling a hyper-threading core or a dual-core core reduces the rate of (but does not eliminate) corruption. This suggests a timing related issue.
Now here's the rub:
When it's run under a lightweight debug environment (say Visual Studio 98 / AKA MSVC6) the heap corruption is reasonably easy to reproduce - ten or fifteen minutes pass before something fails horrendously and exceptions, like an alloc; when running under a sophisticated debug environment (Rational Purify, VS2008/MSVC9 or even Microsoft Application Verifier) the system becomes memory-speed bound and doesn't crash (Memory-bound: CPU is not getting above 50%, disk light is not on, the program's going as fast it can, box consuming 1.3G of 2G of RAM). So, I've got a choice between being able to reproduce the problem (but not identify the cause) or being able to idenify the cause or a problem I can't reproduce.
My current best guesses as to where to next is:
Get an insanely grunty box (to replace the current dev box: 2Gb RAM in an E6550 Core2 Duo); this will make it possible to repro the crash causing mis-behaviour when running under a powerful debug environment; or
Rewrite operators new and delete to use VirtualAlloc and VirtualProtect to mark memory as read-only as soon as it's done with. Run under MSVC6 and have the OS catch the bad-guy who's writing to freed memory. Yes, this is a sign of desperation: who the hell rewrites new and delete?! I wonder if this is going to make it as slow as under Purify et al.
And, no: Shipping with Purify instrumentation built in is not an option.
A colleague just walked past and asked "Stack Overflow? Are we getting stack overflows now?!?"
And now, the question: How do I locate the heap corruptor?
Update: balancing new[] and delete[] seems to have gotten a long way towards solving the problem. Instead of 15mins, the app now goes about two hours before crashing. Not there yet. Any further suggestions? The heap corruption persists.
Update: a release build under Visual Studio 2008 seems dramatically better; current suspicion rests on the STL implementation that ships with VS98.
Reproduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.
I'll take a note of that, but I'm concerned that Dr Watson will only be tripped up after the fact, not when the heap is getting stomped on.
Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.
Got that going at the moment, again: not much help until something goes wrong. I want to catch the vandal in the act.
Maybe these tools will allow you at least to narrow the problem to certain component.
I don't hold much hope, but desperate times call for...
And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?
No I'm not, and I'll spend a couple of hours tomorrow going through the workspace (58 projects in it) and checking they're all compiling and linking with the appropriate flags.
Update: This took 30 seconds. Select all projects in the Settings dialog, unselect until you find the project(s) that don't have the right settings (they all had the right settings).

My first choice would be a dedicated heap tool such as pageheap.exe.
Rewriting new and delete might be useful, but that doesn't catch the allocs committed by lower-level code. If this is what you want, better to Detour the low-level alloc APIs using Microsoft Detours.
Also sanity checks such as: verify your run-time libraries match (release vs. debug, multi-threaded vs. single-threaded, dll vs. static lib), look for bad deletes (eg, delete where delete [] should have been used), make sure you're not mixing and matching your allocs.
Also try selectively turning off threads and see when/if the problem goes away.
What does the call stack etc look like at the time of the first exception?

I have same problems in my work (we also use VC6 sometimes). And there is no easy solution for it. I have only some hints:
Try with automatic crash dumps on production machine (see Process Dumper). My experience says Dr. Watson is not perfect for dumping.
Remove all catch(...) from your code. They often hide serious memory exceptions.
Check Advanced Windows Debugging - there are lots of great tips for problems like yours. I recomend this with all my heart.
If you use STL try STLPort and checked builds. Invalid iterator are hell.
Good luck. Problems like yours take us months to solve. Be ready for this...

We've had pretty good luck by writing our own malloc and free functions. In production, they just call the standard malloc and free, but in debug, they can do whatever you want. We also have a simple base class that does nothing but override the new and delete operators to use these functions, then any class you write can simply inherit from that class. If you have a ton of code, it may be a big job to replace calls to malloc and free to the new malloc and free (don't forget realloc!), but in the long run it's very helpful.
In Steve Maguire's book Writing Solid Code (highly recommended), there are examples of debug stuff that you can do in these routines, like:
Keep track of allocations to find leaks
Allocate more memory than necessary and put markers at the beginning and end of memory -- during the free routine, you can ensure these markers are still there
memset the memory with a marker on allocation (to find usage of uninitialized memory) and on free (to find usage of free'd memory)
Another good idea is to never use things like strcpy, strcat, or sprintf -- always use strncpy, strncat, and snprintf. We've written our own versions of these as well, to make sure we don't write off the end of a buffer, and these have caught lots of problems too.

Run the original application with ADplus -crash -pn appnename.exe
When the memory issue pops-up you will get a nice big dump.
You can analyze the dump to figure what memory location was corrupted.
If you are lucky the overwrite memory is a unique string you can figure out where it came from. If you are not lucky, you will need to dig into win32 heap and figure what was the orignal memory characteristics. (heap -x might help)
After you know what was messed-up, you can narrow appverifier usage with special heap settings. i.e. you can specify what DLL you monitor, or what allocation size to monitor.
Hopefully this will speedup the monitoring enough to catch the culprit.
In my experience, I never needed full heap verifier mode, but I spent a lot of time analyzing the crash dump(s) and browsing sources.
P.S:
You can use DebugDiag to analyze the dumps.
It can point out the DLL owning the corrupted heap, and give you other usefull details.

You should attack this problem with both runtime and static analysis.
For static analysis consider compiling with PREfast (cl.exe /analyze). It detects mismatched delete and delete[], buffer overruns and a host of other problems. Be prepared, though, to wade through many kilobytes of L6 warning, especially if your project still has L4 not fixed.
PREfast is available with Visual Studio Team System and, apparently, as part of Windows SDK.

Is this in low memory conditions? If so it might be that new is returning NULL rather than throwing std::bad_alloc. Older VC++ compilers didn't properly implement this. There is an article about Legacy memory allocation failures crashing STL apps built with VC6.

The apparent randomness of the memory corruption sounds very much like a thread synchronization issue - a bug is reproduced depending on machine speed. If objects (chuncks of memory) are shared among threads and synchronization (critical section, mutex, semaphore, other) primitives are not on per-class (per-object, per-class) basis, then it is possible to come to a situation where class (chunk of memory) is deleted / freed while in use, or used after deleted / freed.
As a test for that, you could add synchronization primitives to each class and method. This will make your code slower because many objects will have to wait for each other, but if this eliminates the heap corruption, your heap-corruption problem will become a code optimization one.

You tried old builds, but is there a reason you can't keep going further back in the repository history and seeing exactly when the bug was introduced?
Otherwise, I would suggest adding simple logging of some kind to help track down the problem, though I am at a loss of what specifically you might want to log.
If you can find out what exactly CAN cause this problem, via google and documentation of the exceptions you are getting, maybe that will give further insight on what to look for in the code.

My first action would be as follows:
Build the binaries in "Release" version but creating debug info file (you will find this possibility in project settings).
Use Dr Watson as a defualt debugger (DrWtsn32 -I) on a machine on which you want to reproduce the problem.
Repdroduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.
Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.
Maybe these tools will allow you at least to narrow the problem to certain component.
And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?

So from the limited information you have, this can be a combination of one or more things:
Bad heap usage, i.e., double frees, read after free, write after free, setting the HEAP_NO_SERIALIZE flag with allocs and frees from multiple threads on the same heap
Out of memory
Bad code (i.e., buffer overflows, buffer underflows, etc.)
"Timing" issues
If it's at all the first two but not the last, you should have caught it by now with either pageheap.exe.
Which most likely means it is due to how the code is accessing shared memory. Unfortunately, tracking that down is going to be rather painful. Unsynchronized access to shared memory often manifests as weird "timing" issues. Things like not using acquire/release semantics for synchronizing access to shared memory with a flag, not using locks appropriately, etc.
At the very least, it would help to be able to track allocations somehow, as was suggested earlier. At least then you can view what actually happened up until the heap corruption and attempt to diagnose from that.
Also, if you can easily redirect allocations to multiple heaps, you might want to try that to see if that either fixes the problem or results in more reproduceable buggy behavior.
When you were testing with VS2008, did you run with HeapVerifier with Conserve Memory set to Yes? That might reduce the performance impact of the heap allocator. (Plus, you have to run with it Debug->Start with Application Verifier, but you may already know that.)
You can also try debugging with Windbg and various uses of the !heap command.
MSN

Graeme's suggestion of custom malloc/free is a good idea. See if you can characterize some pattern about the corruption to give you a handle to leverage.
For example, if it is always in a block of the same size (say 64 bytes) then change your malloc/free pair to always allocate 64 byte chunks in their own page. When you free a 64 byte chunk then set the memory protection bits on that page to prevent reads and wites (using VirtualQuery). Then anyone attempting to access this memory will generate an exception rather than corrupting the heap.
This does assume that the number of outstanding 64 byte chunks is only moderate or you have a lot of memory to burn in the box!

If you choose to rewrite new/delete, I have done this and have simple source code at:
http://gandolf.homelinux.org/~smhanov/blog/?id=10
This catches memory leaks and also inserts guard data before and after the memory block to capture heap corruption. You can just integrate with it by putting #include "debug.h" at the top of every CPP file, and defining DEBUG and DEBUG_MEM.

The little time I had to solve a similar problem.
If the problem still exists I suggest you do this :
Monitor all calls to new/delete and malloc/calloc/realloc/free.
I make single DLL exporting a function for register all calls. This function receive parameter for identifying your code source, pointer to allocated area and type of call saving this information in a table.
All allocated/freed pair is eliminated. At the end or after you need you make a call to an other function for create report for left data.
With this you can identify wrong calls (new/free or malloc/delete) or missing.
If have any case of buffer overwritten in your code the information saved can be wrong but each test may detect/discover/include a solution of failure identified. Many runs to help identify the errors.
Good luck.

Do you think this is a race condition? Are multiple threads sharing one heap? Can you give each thread a private heap with HeapCreate, then they can run fast with HEAP_NO_SERIALIZE. Otherwise, a heap should be thread safe, if you're using the multi-threaded version of the system libraries.

A couple of suggestions. You mention the copious warnings at W4 - I would suggest taking the time to fix your code to compile cleanly at warning level 4 - this will go a long way to preventing subtle hard to find bugs.
Second - for the /analyze switch - it does indeed generate copious warnings. To use this switch in my own project, what I did was to create a new header file that used #pragma warning to turn off all the additional warnings generated by /analyze. Then further down in the file, I turn on only those warnings I care about. Then use the /FI compiler switch to force this header file to be included first in all your compilation units. This should allow you to use the /analyze switch while controling the output

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js