Parallel MPI code in C++ memory errors - c++

I am trying to understand the C++ code I am working with gives out of memory errors. This is a scientific code with several flag variables to turn on/off a bunch of code functionalities. The code works fine when a couple of functions are turned off. However, when these routines are active, it causes 'Out of memory' situations....
Error file created by Qsub, says
Exit status : -4
job terminated due to one or more nodes running out of memory. The function I am talking about used to work fine until I made some additions. I basically created some pointers, intialize to NULL, create a memory chunk to associate with it, store a quantity of interest in it and later delete []*p
I am trying hard to figure out the source of the problem. I wonder what is causing it.. I believe its some C++ programming error (which I am overlooking due to my inexperience with C++). Is there a way to figure out what the bug is .... where it is or how to resolve it.
Some thoughts that ran my mind,
- use try{ } catch {}
- Run some memory program to track the memory usage in the system (in realtime)
- Any other efficient way of debugging a MPI/C++ code for such situations.
I Read about something on stacks and heaps and how memory is stored... What the safest way to declare a 2D-array, 1D-array on the fly... pointer based or array definition based..??
Please educate me with your thoughts.

valgrind should be able to give you an indication on where memory is being leaked.

Related

Out of virtual memory address space (Borland C++ Builder 6 program)

I have problem with some application written under C++ Builder 6. After some time of running (week, month) the application crashes and closes without any error message. In my application log shortly before crash i get many "Out of memory" exceptions.
I looked at the process when it was throwing out of memory exceptions (screenshot below) and it has lots of uncommitted private memory space. What can be a reason of such behavior?
I had such problem once, couple years ago. The reason for that was an option "use dynamic libraries" unchecked in linker options. When I checked it back the problem disappeared and vice versa. The test application which I made was just calling "new char[1000000]" and then delete. The memory was freed every time (no committed memory rise in windows task manager), but after some time I got out of memory, VMMap showed exactly the same thing. Lots of reserved private memory but most of it uncommitted.
Now the problem returned but I can't fix it the same way. I don't know if that was the reason but I had Builder 6 and 2010 istalled on the same machine. Now I just have Builder 6 and it seems that I cannot reproduce the error with test application like before. Ether way it seems that there is some memory manager error or something. CodeGuard doesn't show any memory leaks. When I create memory block with "new" it instantly shows in "memory commit size" and when delete the memory usage decreases, so I assume that the memory leaks are not the case, task manager doesn't show much "memory commit size".
Is there anything I can do? Is there any way I can release uncommitted memory? How to diagnose the problem any further?
The screenshot:
http://i.stack.imgur.com/UKuTZ.jpg
I found a way to simulate this problem and solution.
for(int i=0; i<100; i++)
{
char * b = new char[100000000];
new char;
delete b;
}
Borland memory manager reserves a block of memory which size is multiple of one page which is 4kB. When allocating memory size different than multiple of 4kB there is some free space which borland may use to allocate some other memory chunk. When the first chunk is deallocated the second is still keeping hole memory block reserved.
At first look the code should cause just 100B memory leak, but in fact it will cause memory allocation exception after less than 16 iterations.
I have found two solutions for this problem. One is FastMM, it works but also brings some troubles with it too.
Second solution is to exchange borlndmm.dll with the one from Embarcadero Rad Studio 2010. I didn't test it thoroughly yet but it seems to work without any problem.
I should move the hole project to RAD 2010 but for some reasons I got stuck in Borland 6.
Prologue
Hmm interesting behaviour... I have to add some thing I learned the hard way. I dismiss BCB6 immediately after try it for few times because it had too much bugs for my taste (in comparison with BCB5 especially with AnsiStrings handling). So I stayed with BCB5 for a long time without any problems. I used it even for a very Big projects like CAD/CAM.
After few years pass I had to move to BDS2006 because of my employer and the problems start (some can be similar to yours). Aside the minor IDE and trace/breakpoint/codeguard bugs there are more important things like:
memory manager
delete/delete[] corrupts memory manager if called twice for the same pointer without throwing any exception to notify ...
wrong default constructor/destructor for struct compiler bug was the biggest problem I had (in combination with the delete)
wrong or missing member functions in classes can cause multiple delete call !!! due t bug either in compiler or in C++ engine.
but I was lucky and solve it here: bds 2006 C hidden memory manager conflicts (class new / delete[] vs. AnsiString)
wrong compile
Sometimes app is compiled wrongly, no error is thrown but some lines of code are missing in exe and/or are in different order then in source code. I saw this occasionally also in BCB 5,6. To solve that:
delete all temp files like ~,obj,tds,map,exe,...
close IDE and open it again just to be sure (sometimes view of local variables (mostly big arrays) corrupt IDE memory)
compile again
beware breakpoint/trace/codeguard behave differently then raw app
especially with multi-threading App behaves different while traced and while not. Also codeguard do a big difference (and I do not mean the slow down of execution which corrupts sensitive timing). For example codeguard has a nasty habit of throwing out of memory exceptions without a reason sometimes so some parts of code must be checked over and over until it goes through sometimes even if the mem usage is still the same and far from out of mem.
AnsiString operators
There are two types of AnsiString in VCL normal and component property. So it is wise to take that into account because for component property AnsiString's are the operation of operators different. Try for example something like
Edit1->Text+="xxx";
also there are still AnsiString operator bugs like this:
AnsiString version="aaa"+AnsiString("aaa")+"aaa"; // codeguard: array access violation
Import older BCB projects
Avoid direct import if possible it often creates some unknown allocation and memleaks errors. I am not sure why but I suspect that imported window classes are handled differently and the memleaks are related to bullet #1. Better way is create new App and create/copy components and code manually. I know it is backword but the only safe way to avoid problems still don't know where is the problem but simple *.bdsproj replacement will not help !!! And in *.dfm I had not seen anything suspicious.

Dwarf Error: Cannot find DIE

I am having a lot of trouble debugging a segmentation fault in a C++ project in XCode 4.
I only get a segfault when I built with the "LLVM 2.0" compiler option and use -O3 optimization. From what I understand, there are limited debugging options when one is using optimization, but here is the debug output I get after I run in Xcode with gdb turned on:
warning: Got an error handling event: "Dwarf Error: Cannot find DIE at 0x3be2 referenced from DIE at 0x11d [in module /Users/imran/Library/Developer/Xcode/DerivedData/cgo-hczcifktgscxjigfphieegbpxxsq/Build/Products/Debug/cgo]".
No memory available to program now: unsafe to call malloc
I can't get gdb to give me any useful info after that (like a trace), but I'm not sure I really know how to use it properly. When I try to use the "LLDB" debugger Xcode just crashes (which has been a common theme since I started using it).
My program is deterministic, but when I try to isolate the problem with print statements the behavior will change. For example if I add cout << "hello"; at one point the segfault goes away. Other print statements cause my program to segfault in a different iteration of its main loop. And naturally when I put in enough print statements to supposedly pinpoint the offending code, the segfault seems to occur after one line but before the next (i.e. nowhere).
I am using pointers and dynamic memory allocation, which is likely the cause of the problem, but since I can't narrow down the block of code causing the error I don't know what code to show here.
I tried profiling with the "Leaks" tool in Instruments, but it didn't find any leaks.
Any advice? I am very inexperienced with debugging so anything would help, really.
EDIT: Solved. Given certain inputs, my program would try to read past the end of an array.
I don't think there's enough information that I can help you with the DWARF issue. I am not familiar enough with that toolchain to know how robust it is.
Your crashing symptoms however smell a lot like heap corruption. I don't know what allocator OSX uses by default, but common optimizations store metadata inline with objects and/or thread the freelist through empty objects, which makes them very sensitive to buffer overflows on the heap. Freeing an object twice or using a dangling pointer (a pointer that has been freed but whose space may now be in use by another allocation) can also cause seemingly nondeterministic and hard to track errors, since the layout of the heap is likely to change between runs. Print statements also use the allocator, which means changing the print statements can change when and where the problem will appear.
A tool that you may find helpful in determining if this is a heap problem or something unrelated is a heap replacement called DieHard by my advisor (http://prisms.cs.umass.edu/emery/index.php?page=download-diehard). I believe it will build on OSX, and you can link it into your program using LD_PRELOAD=/path/to/libdiehard.so to replace the default allocator at runtime. Its sole purpose is to resist memory errors and heap corruption, so if your application actually runs with it, that's probably where you need to look.

Fixing Segmentation faults in C++

I am writing a cross-platform C++ program for Windows and Unix. On the Window side, the code will compile and execute no problem. On the Unix side, it will compile however when I try to run it, I get a segmentation fault. My initial hunch is that there is a problem with pointers.
What are good methodologies to find and fix segmentation fault errors?
Compile your application with -g, then you'll have debug symbols in the binary file.
Use gdb to open the gdb console.
Use file and pass it your application's binary file in the console.
Use run and pass in any arguments your application needs to start.
Do something to cause a Segmentation Fault.
Type bt in the gdb console to get a stack trace of the Segmentation Fault.
Sometimes the crash itself isn't the real cause of the problem-- perhaps the memory got smashed at an earlier point but it took a while for the corruption to show itself. Check out valgrind, which has lots of checks for pointer problems (including array bounds checking). It'll tell you where the problem starts, not just the line where the crash occurs.
Before the problem arises, try to avoid it as much as possible:
Compile and run your code as often as you can. It will be easier to locate the faulty part.
Try to encapsulate low-level / error prone routines so that you rarely have to work directly with memory (pay attention to the modelization of your program)
Maintain a test-suite. Having an overview of what is currently working, what is no more working etc, will help you to figure out where the problem is (Boost test is a possible solution, I don't use it myself but the documentation can help to understand what kind of information must be displayed).
Use appropriate tools for debugging. On Unix:
GDB can tell you where you program crash and will let you see in what context.
Valgrind will help you to detect many memory-related errors.
With GCC you can also use mudflap With GCC, Clang and since October experimentally MSVC you can use Address/Memory Sanitizer. It can detect some errors that Valgrind doesn't and the performance loss is lighter. It is used by compiling with the-fsanitize=address flag.
Finally I would recommend the usual things. The more your program is readable, maintainable, clear and neat, the easiest it will be to debug.
On Unix you can use valgrind to find issues. It's free and powerful. If you'd rather do it yourself you can overload the new and delete operators to set up a configuration where you have 1 byte with 0xDEADBEEF before and after each new object. Then track what happens at each iteration. This can fail to catch everything (you aren't guaranteed to even touch those bytes) but it has worked for me in the past on a Windows platform.
Yes, there is a problem with pointers. Very likely you're using one that's not initialized properly, but it's also possible that you're messing up your memory management with double frees or some such.
To avoid uninitialized pointers as local variables, try declaring them as late as possible, preferably (and this isn't always possible) when they can be initialized with a meaningful value. Convince yourself that they will have a value before they're being used, by examining the code. If you have difficulty with that, initialize them to a null pointer constant (usually written as NULL or 0) and check them.
To avoid uninitialized pointers as member values, make sure they're initialized properly in the constructor, and handled properly in copy constructors and assignment operators. Don't rely on an init function for memory management, although you can for other initialization.
If your class doesn't need copy constructors or assignment operators, you can declare them as private member functions and never define them. That will cause a compiler error if they're explicitly or implicitly used.
Use smart pointers when applicable. The big advantage here is that, if you stick to them and use them consistently, you can completely avoid writing delete and nothing will be double-deleted.
Use C++ strings and container classes whenever possible, instead of C-style strings and arrays. Consider using .at(i) rather than [i], because that will force bounds checking. See if your compiler or library can be set to check bounds on [i], at least in debug mode. Segmentation faults can be caused by buffer overruns that write garbage over perfectly good pointers.
Doing those things will considerably reduce the likelihood of segmentation faults and other memory problems. They will doubtless fail to fix everything, and that's why you should use valgrind now and then when you don't have problems, and valgrind and gdb when you do.
I don't know of any methodology to use to fix things like this. I don't think it would be possible to come up with one either for the very issue at hand is that your program's behavior is undefined (I don't know of any case when SEGFAULT hasn't been caused by some sort of UB).
There are all kinds of "methodologies" to avoid the issue before it arises. One important one is RAII.
Besides that, you just have to throw your best psychic energies at it.

C++: Where to start when my application crashes at random places?

I'm developing a game and when I do a specific action in the game, it crashes.
So I went debugging and I saw my application crashed at simple C++ statements like if, return, ... Each time when I re-run, it crashes randomly at one of 3 lines and it never succeeds.
line 1:
if (dynamic) { ... } // dynamic is a bool member of my class
line 2:
return m_Fixture; // a line of the Box2D physical engine. m_Fixture is a pointer.
line 3:
return m_Density; // The body of a simple getter for an integer.
I get no errors from the app nor the OS...
Are there hints, tips or tricks to debug more efficient and get known what is going on?
That's why I love Java...
Thanks
Random crashes like this are usually caused by stack corruption, since these are branching instructions and thus are sensitive to the condition of the stack. These are somewhat hard to track down, but you should run valgrind and examine the call stack on each crash to try and identify common functions that might be the root cause of the error.
Are there hints, tips or tricks to debug more efficient and get known what is going on?
Run game in debugger, on the point of crash, check values of all arguments. Either using visual studio watch window or using gdb. Using "call stack" check parent routines, try to think what could go wrong.
In suspicious(potentially related to crash) routines, consider dumping all arguments to stderr (if you're using libsdl or on *nixlike systems), or write a logfile, or send dupilcates of all error messages using (on Windows) OutputDebugString. This will make them visible in "output" window in visual studio or debugger. You can also write "traces" (log("function %s was called", __FUNCTION__))
If you can't debug immediately, produce core dumps on crash. On windows it can be done using MiniDumpWriteDump, on linux it is set somewhere in configuration variables. core dumps can be handled by debugger. I'm not sure if VS express can deal with them on Windows, but you still can debug them using WinDBG.
if crash happens within class, check *this argument. It could be invalid or zero.
If the bug is truly evil (elusive stack corruption in multithreaded app that leads to delayed crash), write custom memory manager, that will override new/delete, provide alternative to malloc(if your app for some reason uses it, which may be possible), AND that locks all unused memory memory using VirtualProtect (windows) or OS-specific alternative. In this case all potentially dangerous operation will crash app instantly, which will allow you to debug the problem (if you have Just-In-Time debugger) and instantly find dangerous routine. I prefer such "custom memory manager" to boundschecker and such - since in my experience it was more useful. As an alternative you could try to use valgrind, which is available on linux only. Note, that if your app very frequently allocates memory, you'll need a large amount of RAM in order to be able to lock every unused memory block (because in order to be locked, block should be PAGE_SIZE bytes big).
In areas where you need sanity check either use ASSERT, or (IMO better solution) write a routine that will crash the application (by throwing an std::exception with a meaningful message) if some condition isn't met.
If you've identified a problematic routine, walk through it using debugger's step into/step over. Watch the arguments.
If you've identified a problematic routine, but can't directly debug it for whatever reason, after every statement within that routine, dump all variables into stderr or logfile (fprintf or iostreams - your choice). Then analyze outputs and think how it could have happened. Make sure to flush logfile after every write, or you might miss the data right before the crash.
In general you should be happy that app crashes somewhere. Crash means a bug you can quickly find using debugger and exterminate. Bugs that don't crash the program are much more difficult (example of truly complex bug: given 100000 values of input, after few hundreds of manipulations with values, among thousands of outputs, app produces 1 absolutely incorrect result, which shouldn't have happened at all)
That's why I love Java...
Excuse me, if you can't deal with language, it is entirely your fault. If you can't handle the tool, either pick another one or improve your skill. It is possible to make game in java, by the way.
These are mostly due to stack corruption, but heap corruption can also affect programs in this way.
stack corruption occurs most of the time because of "off by one errors".
heap corruption occurs because of new/delete not being handled carefully, like double delete.
Basically what happens is that the overflow/corruption overwrites an important instruction, then much much later on, when you try to execute the instruction, it will crash.
I generally like to take a second to step back and think through the code, trying to catch any logic errors.
You might try commenting out different parts of the code and seeing if it affects how the program is compiled.
Besides those two things you could try using a debugger like Visual Studio or Eclipse etc...
Lastly you could try to post your code and the error you are getting on a website with a community that knows programming and could help you work through the error (read: stackoverflow)
Crashes / Seg faults usually happen when you access a memory location that it is not allowed to access, or you attempt to access a memory location in a way that is not allowed (for example, attempting to write to a read-only location).
There are many memory analyzer tools, for example I use Valgrind which is really great in telling what the issue is (not only the line number, but also what's causing the crash).
There are no simple C++ statements. An if is only as simple as the condition you evaluate. A return is only as simple as the expression you return.
You should use a debugger and/or post some of the crashing code. Can't be of much use with "my app crashed" as information.
I had problems like this before. I was trying to refresh the GUI from different threads.
If the if statements involve dereferencing pointers, you're almost certainly corrupting the stack (this explains why an innocent return 0 would crash...)
This can happen, for instance, by going out of bounds in an array (you should be using std::vector!), trying to strcpy a char[]-based string missing the ending '\0' (you should be using std::string!), passing a bad size to memcpy (you should be using copy-constructors!), etc.
Try to figure out a way to reproduce it reliably, then place a watch on the corrupted pointer. Run through the code line-by-line until you find the very line that corrupts the pointer.
Look at the disassembly. Almost any C/C++ debugger will be happy to show you the machine code and the registers where the program crashed. The registers include the Instruction Pointer (EIP or RIP on x86/x64) which is where the program was when it stopped. The other registers usually have memory addresses or data. If the memory address is 0 or a bad pointer, there is your problem.
Then you just have to work backward to find out how it got that way. Hardware breakpoints on memory changes are very helpful here.
On a Linux/BSD/Mac, using GDB's scripting features can help a lot here. You can script things so that after the breakpoint is hit 20 times it enables a hardware watch on the address of array element 17. Etc.
You can also write debugging into your program. Use the assert() function. Everywhere!
Use assert to check the arguments to every function. Use assert to check the state of every object before you exit the function. In a game, assert that the player is on the map, that the player has health between 0 and 100, assert everything that you can think of. For complicated objects write verify() or validate() functions into the object itself that checks everything about it and then call those from an assert().
Another way to write in debugging is to have the program use signal() in Linux or asm int 3 in Windows to break into the debugger from the program. Then you can write temporary code into the program to check if it is on iteration 1117321 of the main loop. That can be useful if the bug always happens at 1117322. The program will execute much faster this way than to use a debugger breakpoint.
some tips :
- run your application under a debugger, with the symbol files (PDB) together.
- How to set Visual Studio as the default post-mortem debugger?
- set default debugger for WinDbg Just-in-time Debugging
- check memory allocations Overriding new and delete, and Overriding malloc and free
One other trick: turn off code optimization and see if the crash points make more sense. Optimization is allowed to float little bits of your code to surprising places; mapping that back to source code lines can be less than perfect.
Check pointers. At a guess, you're dereferencing a null pointer.
I've found 'random' crashes when there are some reference to a deleted object. As the memory is not necessarily overwritten, in many cases you don't notice it and the program works correctly, and than crashes after the memory was updated and is not valid anymore.
JUST FOR DEBUGGING PURPOSES, try commenting out some suspicious 'deletes'. Then, if it doesn't crash anymore, there you are.
use the GNU Debugger
Refactoring.
Scan all the code, make it clearer if not clear at first read, try to understand what you wrote and immediately fix what seems incorrect.
You'll certainly discover the problem(s) this way and fix a lot of other problems too.

Heap corruption under Win32; how to locate?

I'm working on a multithreaded C++ application that is corrupting the heap. The usual tools to locate this corruption seem to be inapplicable. Old builds (18 months old) of the source code exhibit the same behaviour as the most recent release, so this has been around for a long time and just wasn't noticed; on the downside, source deltas can't be used to identify when the bug was introduced - there are a lot of code changes in the repository.
The prompt for crashing behaviuor is to generate throughput in this system - socket transfer of data which is munged into an internal representation. I have a set of test data that will periodically cause the app to exception (various places, various causes - including heap alloc failing, thus: heap corruption).
The behaviour seems related to CPU power or memory bandwidth; the more of each the machine has, the easier it is to crash. Disabling a hyper-threading core or a dual-core core reduces the rate of (but does not eliminate) corruption. This suggests a timing related issue.
Now here's the rub:
When it's run under a lightweight debug environment (say Visual Studio 98 / AKA MSVC6) the heap corruption is reasonably easy to reproduce - ten or fifteen minutes pass before something fails horrendously and exceptions, like an alloc; when running under a sophisticated debug environment (Rational Purify, VS2008/MSVC9 or even Microsoft Application Verifier) the system becomes memory-speed bound and doesn't crash (Memory-bound: CPU is not getting above 50%, disk light is not on, the program's going as fast it can, box consuming 1.3G of 2G of RAM). So, I've got a choice between being able to reproduce the problem (but not identify the cause) or being able to idenify the cause or a problem I can't reproduce.
My current best guesses as to where to next is:
Get an insanely grunty box (to replace the current dev box: 2Gb RAM in an E6550 Core2 Duo); this will make it possible to repro the crash causing mis-behaviour when running under a powerful debug environment; or
Rewrite operators new and delete to use VirtualAlloc and VirtualProtect to mark memory as read-only as soon as it's done with. Run under MSVC6 and have the OS catch the bad-guy who's writing to freed memory. Yes, this is a sign of desperation: who the hell rewrites new and delete?! I wonder if this is going to make it as slow as under Purify et al.
And, no: Shipping with Purify instrumentation built in is not an option.
A colleague just walked past and asked "Stack Overflow? Are we getting stack overflows now?!?"
And now, the question: How do I locate the heap corruptor?
Update: balancing new[] and delete[] seems to have gotten a long way towards solving the problem. Instead of 15mins, the app now goes about two hours before crashing. Not there yet. Any further suggestions? The heap corruption persists.
Update: a release build under Visual Studio 2008 seems dramatically better; current suspicion rests on the STL implementation that ships with VS98.
Reproduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.
I'll take a note of that, but I'm concerned that Dr Watson will only be tripped up after the fact, not when the heap is getting stomped on.
Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.
Got that going at the moment, again: not much help until something goes wrong. I want to catch the vandal in the act.
Maybe these tools will allow you at least to narrow the problem to certain component.
I don't hold much hope, but desperate times call for...
And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?
No I'm not, and I'll spend a couple of hours tomorrow going through the workspace (58 projects in it) and checking they're all compiling and linking with the appropriate flags.
Update: This took 30 seconds. Select all projects in the Settings dialog, unselect until you find the project(s) that don't have the right settings (they all had the right settings).
My first choice would be a dedicated heap tool such as pageheap.exe.
Rewriting new and delete might be useful, but that doesn't catch the allocs committed by lower-level code. If this is what you want, better to Detour the low-level alloc APIs using Microsoft Detours.
Also sanity checks such as: verify your run-time libraries match (release vs. debug, multi-threaded vs. single-threaded, dll vs. static lib), look for bad deletes (eg, delete where delete [] should have been used), make sure you're not mixing and matching your allocs.
Also try selectively turning off threads and see when/if the problem goes away.
What does the call stack etc look like at the time of the first exception?
I have same problems in my work (we also use VC6 sometimes). And there is no easy solution for it. I have only some hints:
Try with automatic crash dumps on production machine (see Process Dumper). My experience says Dr. Watson is not perfect for dumping.
Remove all catch(...) from your code. They often hide serious memory exceptions.
Check Advanced Windows Debugging - there are lots of great tips for problems like yours. I recomend this with all my heart.
If you use STL try STLPort and checked builds. Invalid iterator are hell.
Good luck. Problems like yours take us months to solve. Be ready for this...
We've had pretty good luck by writing our own malloc and free functions. In production, they just call the standard malloc and free, but in debug, they can do whatever you want. We also have a simple base class that does nothing but override the new and delete operators to use these functions, then any class you write can simply inherit from that class. If you have a ton of code, it may be a big job to replace calls to malloc and free to the new malloc and free (don't forget realloc!), but in the long run it's very helpful.
In Steve Maguire's book Writing Solid Code (highly recommended), there are examples of debug stuff that you can do in these routines, like:
Keep track of allocations to find leaks
Allocate more memory than necessary and put markers at the beginning and end of memory -- during the free routine, you can ensure these markers are still there
memset the memory with a marker on allocation (to find usage of uninitialized memory) and on free (to find usage of free'd memory)
Another good idea is to never use things like strcpy, strcat, or sprintf -- always use strncpy, strncat, and snprintf. We've written our own versions of these as well, to make sure we don't write off the end of a buffer, and these have caught lots of problems too.
Run the original application with ADplus -crash -pn appnename.exe
When the memory issue pops-up you will get a nice big dump.
You can analyze the dump to figure what memory location was corrupted.
If you are lucky the overwrite memory is a unique string you can figure out where it came from. If you are not lucky, you will need to dig into win32 heap and figure what was the orignal memory characteristics. (heap -x might help)
After you know what was messed-up, you can narrow appverifier usage with special heap settings. i.e. you can specify what DLL you monitor, or what allocation size to monitor.
Hopefully this will speedup the monitoring enough to catch the culprit.
In my experience, I never needed full heap verifier mode, but I spent a lot of time analyzing the crash dump(s) and browsing sources.
P.S:
You can use DebugDiag to analyze the dumps.
It can point out the DLL owning the corrupted heap, and give you other usefull details.
You should attack this problem with both runtime and static analysis.
For static analysis consider compiling with PREfast (cl.exe /analyze). It detects mismatched delete and delete[], buffer overruns and a host of other problems. Be prepared, though, to wade through many kilobytes of L6 warning, especially if your project still has L4 not fixed.
PREfast is available with Visual Studio Team System and, apparently, as part of Windows SDK.
Is this in low memory conditions? If so it might be that new is returning NULL rather than throwing std::bad_alloc. Older VC++ compilers didn't properly implement this. There is an article about Legacy memory allocation failures crashing STL apps built with VC6.
The apparent randomness of the memory corruption sounds very much like a thread synchronization issue - a bug is reproduced depending on machine speed. If objects (chuncks of memory) are shared among threads and synchronization (critical section, mutex, semaphore, other) primitives are not on per-class (per-object, per-class) basis, then it is possible to come to a situation where class (chunk of memory) is deleted / freed while in use, or used after deleted / freed.
As a test for that, you could add synchronization primitives to each class and method. This will make your code slower because many objects will have to wait for each other, but if this eliminates the heap corruption, your heap-corruption problem will become a code optimization one.
You tried old builds, but is there a reason you can't keep going further back in the repository history and seeing exactly when the bug was introduced?
Otherwise, I would suggest adding simple logging of some kind to help track down the problem, though I am at a loss of what specifically you might want to log.
If you can find out what exactly CAN cause this problem, via google and documentation of the exceptions you are getting, maybe that will give further insight on what to look for in the code.
My first action would be as follows:
Build the binaries in "Release" version but creating debug info file (you will find this possibility in project settings).
Use Dr Watson as a defualt debugger (DrWtsn32 -I) on a machine on which you want to reproduce the problem.
Repdroduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.
Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.
Maybe these tools will allow you at least to narrow the problem to certain component.
And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?
So from the limited information you have, this can be a combination of one or more things:
Bad heap usage, i.e., double frees, read after free, write after free, setting the HEAP_NO_SERIALIZE flag with allocs and frees from multiple threads on the same heap
Out of memory
Bad code (i.e., buffer overflows, buffer underflows, etc.)
"Timing" issues
If it's at all the first two but not the last, you should have caught it by now with either pageheap.exe.
Which most likely means it is due to how the code is accessing shared memory. Unfortunately, tracking that down is going to be rather painful. Unsynchronized access to shared memory often manifests as weird "timing" issues. Things like not using acquire/release semantics for synchronizing access to shared memory with a flag, not using locks appropriately, etc.
At the very least, it would help to be able to track allocations somehow, as was suggested earlier. At least then you can view what actually happened up until the heap corruption and attempt to diagnose from that.
Also, if you can easily redirect allocations to multiple heaps, you might want to try that to see if that either fixes the problem or results in more reproduceable buggy behavior.
When you were testing with VS2008, did you run with HeapVerifier with Conserve Memory set to Yes? That might reduce the performance impact of the heap allocator. (Plus, you have to run with it Debug->Start with Application Verifier, but you may already know that.)
You can also try debugging with Windbg and various uses of the !heap command.
MSN
Graeme's suggestion of custom malloc/free is a good idea. See if you can characterize some pattern about the corruption to give you a handle to leverage.
For example, if it is always in a block of the same size (say 64 bytes) then change your malloc/free pair to always allocate 64 byte chunks in their own page. When you free a 64 byte chunk then set the memory protection bits on that page to prevent reads and wites (using VirtualQuery). Then anyone attempting to access this memory will generate an exception rather than corrupting the heap.
This does assume that the number of outstanding 64 byte chunks is only moderate or you have a lot of memory to burn in the box!
If you choose to rewrite new/delete, I have done this and have simple source code at:
http://gandolf.homelinux.org/~smhanov/blog/?id=10
This catches memory leaks and also inserts guard data before and after the memory block to capture heap corruption. You can just integrate with it by putting #include "debug.h" at the top of every CPP file, and defining DEBUG and DEBUG_MEM.
The little time I had to solve a similar problem.
If the problem still exists I suggest you do this :
Monitor all calls to new/delete and malloc/calloc/realloc/free.
I make single DLL exporting a function for register all calls. This function receive parameter for identifying your code source, pointer to allocated area and type of call saving this information in a table.
All allocated/freed pair is eliminated. At the end or after you need you make a call to an other function for create report for left data.
With this you can identify wrong calls (new/free or malloc/delete) or missing.
If have any case of buffer overwritten in your code the information saved can be wrong but each test may detect/discover/include a solution of failure identified. Many runs to help identify the errors.
Good luck.
Do you think this is a race condition? Are multiple threads sharing one heap? Can you give each thread a private heap with HeapCreate, then they can run fast with HEAP_NO_SERIALIZE. Otherwise, a heap should be thread safe, if you're using the multi-threaded version of the system libraries.
A couple of suggestions. You mention the copious warnings at W4 - I would suggest taking the time to fix your code to compile cleanly at warning level 4 - this will go a long way to preventing subtle hard to find bugs.
Second - for the /analyze switch - it does indeed generate copious warnings. To use this switch in my own project, what I did was to create a new header file that used #pragma warning to turn off all the additional warnings generated by /analyze. Then further down in the file, I turn on only those warnings I care about. Then use the /FI compiler switch to force this header file to be included first in all your compilation units. This should allow you to use the /analyze switch while controling the output