Software crash after steep rise of process' working set memory - c++

I ask this question, because we're really stuck at finding the cause of a software crash. I know that questions like "Why does the software crash" are not appreciated, but we really don't know how to find the problem.
We currently do a longterm test of our software. To find potential memory leaks, we used the windows tool Performance monitor to track several memory metrics, such as Private bytes, Working set and Virtual bytes.
The software ran quite a long time (about 30 hours) without any problems. It does the same all the time, reading in an image from the harddrive, doing some inspection and showing some results.
Then suddenly it crashes. Inspecting the memory metrics in the performance monitor, we saw that strange steep rising of the working set bytes graph at 10.17AM. We encountered this several times and according to the dumpfiles, the exception code is always 0xc0000005 : "the thread tried to read from or write to a virtual address for which it does not have the appropriate access", but it appears at different positions, where no pointers are used.
Does someone know, what could be the cause of such a steep rise of the working set and why this could cause a software crash? How could we find out, if our software has a bug, when every time, the crash occurs the position of the crash is at another position?
The application is written in C++ and it runs on a windows 7 32bit pc.

It's actually impossible to know from the information that you have provided, but I would suggest that you have some memory corruption (hence the access violation). It could be a buffer-overflow issue... for example there is a missing null character from a string and so something is being appended indefinitely?
Recommended next step is to download the Debugging Tools for Windows suite. Setup WinDbg with your correct symbol files, and analyse the stack trace, to find the general area of the crash. Depending on the cause of the memory corruption this will be more or less useful. You could have corrupted the memory a long time before your crash occurs.
Ideally also run a static analysis tool on the code.

Given information you have now, there is little chance to get an answer. You need more information, more specifically:
Get more intelligence (is there anything specific about that files which cause crash? What about last-but-one file?)
Insert more tracing and logging (as much as you can without making it 2x slower). At least you'll see where it crashes, and then will be able to insert more tracing/logging around that place
As you're on Windows - consider handling c0000005 via _set_se_translator, converting it into C++ exception, and even more logging on the way this exception is unwinded.
There is no silver bullet for this kind of problems, only gathering more information and figuring it out.
P.S. As an unlikely shot - I've seen similar things to be caused by a bug in MS heap; if you're not using LFH yet (not sure, it might be default now) - there is an 1% chance changing your default heap to LFH will help.

Related

What's the best way of finding a heap corruption that only occurs under a performance test?

The software I work (written in C++) on has a heap corruption problem at the moment. Our perf test team keep getting WER faults when the number of users logged on to the box reaches a certain threshhold but the dumps they've given me just show corruptions in inoncent areas (like when std::string frees it's underlying memory for example).
I've tried using Appverifier and this did throw up a number of issues which I've now fixed. However I'm now in the situation where the testers can load up the machine as much as possible with Appverifier and have a clean run but still get heap corruption when running without Appverifier (I guess since they can get more users on etc without). This has meant I've been unable to get a dump which actually shows the problem.
Does anyone have any other ideas for useful techniques or technologies I can use? I've done as much analysis as I can on the heap corruption dumps without appverifier but I can't see any common themes. No threads doing anything intersting at the same time as the crash, and the thread which crashes is innocent which makes me think the corruption occured some time before.
The best tool is Appverifier in combination with gFlags but there are many other solutions that may help.
For example, you could specify a heap check every 16 malloc, realloc, free, and _msize operations with the following code:
#include <crtdbg.h>
int main( )
{
int tmp;
// Get the current bits
tmp = _CrtSetDbgFlag(_CRTDBG_REPORT_FLAG);
// Clear the upper 16 bits and OR in the desired freqency
tmp = (tmp & 0x0000FFFF) | _CRTDBG_CHECK_EVERY_16_DF;
// Set the new bits
_CrtSetDbgFlag(tmp);
}
You have my sympathies: a very difficult problem to track down.
As you say normally these occur some time prior to the crash, generally as the result of a misbehaving write (e.g. writing to deleted memory, running off the end of an array, exceeding the allocated memory in a memcpy, etc).
In the past (on Linux, I gather you're on Windows) I've used heap-checking tools (valgrind, purify, intel inspector) but as you've observed these often affect the performance and thus obscure the bug. (You don't say whether its a multi-threaded app, or processing a variable dataset such as incoming messages).
I have also overloaded the new and delete operators to detect double deletes, but this is quite a specific situation.
If none of the available tools help, then you're on you're own and its going to be a long debugging process.
The best advice for this I can offer is to work on reducing the test scenario which will reproduce it. Then attempt to reduce the amount of code being exercised, i.e. stubbing out parts of functionality. Eventually you'll zero-in on the problem, but I've seen very good guys spend 6 weeks or more tracking these down on a large application (~1.5 million LOC).
All the best.
You should elaborate further on what your software actually does. Is it multi-threaded? When you talk about "number of users logged on to the box" does each user open a different instance of your software in a different session? Is your software a web service? Do instances talk to eachother (like with named pipes)?
If your error ONLY occurs at high load and does NOT occur when AppVerifier is running. The only two possibilities (without more information) that I can think of is a concurrency issue with how you've implemented multi-threading OR the test machine has a hardware issue that only manifests under heavy load (have your testers used more than one machine?).

What to do when an out-of-memory error occurs? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What's the graceful way of handling out of memory situations in C/C++?
Hi,
this seems to be a simple question a first glance. And I don't want to start a huge discussion on what-is-the-best-way-to-do-this....
Context: Windows >= 5, 32 bit, C++, Windows SDK / Win32 API
But after asking a similar question, I read some MSDN and about the Win32 memory management, so now I'm even more confused on what to do if an allocation fails, let's say the C++ new operator.
So I'm very interested now in how you implement (and implicitly, if you do implement) an error handling for OOM in your applications.
If, where (main function?), for which operations (allocations) , and how you handle an OOM error.
(I don't really mean that subjectively, turning this into a question of preference, I just like to see different approaches that account for different conditions or fit different situations. So feel free to offer answers for GUI apps, services - user-mode stuff ....)
Some exemplary reactions to OOM to show what I mean:
GUI app: Message box, exit process
non-GUI app: Log error, exit process
service: try to recover, e.g. kill the thread that raised an exception, but continue execution
critical app: try again until an allocation succeeds (reducing the requested amount of memory)
hands from OOM, let STL / boost / OS handle it
Thank you for your answers!
The best-explained way will receive the great honour of being an accepted answer :D - even if it only consists of a MessageBox line, but explains why evering else was useless, wrong or unneccessary.
Edit: I appreciate your answers so far, but I'm missing a bit of an actual answer; what I mean is most of you say don't mind OOM since you can't do anything when there's no memory left (system hangs / poor performance). But does that mean to avoid any error handling for OOM? Or only do a simple try-catch in the main showing a MessageBox?
On most modern OSes, OOM will occur long after the system has become completely unusable, since before actually running out, the virtual memory system will start paging physical RAM out to make room for allocating additional virtual memory and in all likelihood the hard disk will begin to thrash like crazy as pages have to be swapped in and out at higher and higher frequencies.
In short, you have much more serious concerns to deal with before you go anywhere near OOM conditions.
Side note: At the moment, the above statement isn't as true as it used to be, since 32-bit machines with loads of physical RAM can exhaust their address space before they start to page. But this is still not common and is only temporary, as 64-bit ramps up and approaches mainstream adoption.
Edit: It seems that 64-bit is already mainstream. While perusing the Dell web site, I couldn't find a single 32-bit system on offer.
You do the exact same thing you do when:
you created 10,000 windows
you allocated 10,000 handles
you created 2,000 threads
you exceeded your quota of kernel pool memory
you filled up the hard disk to capacity.
You send your customer a very humble message where you apologize for writing such crappy code and promise a delivery date for the bug fix. Any else is not nearly good enough. How you want to be notified about it is up to you.
Basically, you should do whatever you can to avoid having the user lose important data. If disk space is available, you might write out recovery files. If you want to be super helpful, you might allocate recovery files while your program is open, to ensure that they will be available in case of emergency.
Simply display a message or dialog box (depending on whether your in a terminal or window system), saying "Error: Out of memory", possibly with debugging info, and include an option for your user to file a bug report, or a web link to where they can do that.
If your really out of memory then, in all honesty, there's no point doing anything other than gracefully exiting, trying to handle the error is useless as there is nothing you can do.
In my case, what happens when you have an app that fragments the memory up so much it cannot allocate the contiguous block needed to process the huge amount of nodes?
Well, I split the processing up as much as I could.
For OOM, you can do the same thing, chop your processes up into as many pieces as possible and do them sequentially.
Of course, for handling the error until you get to fix it (if you can!), you typically let it crash. Then you determine that those memory allocs are failing (like you never expected) and put a error message direct to the user along the lines of "oh dear, its all gone wrong. log a call with the support dept". In all cases, you inform the user however you like. Though, its established practice to use whatever mechanism the app currently uses - if it writes to a log file, do that, if it displays an error dialog, do the same, if it uses the Windows 'send info to microsoft' dialog, go right ahead and let that be the bearer of bad tidings - users are expecting it, so don't try to be clever and do something else.
It depends on your app, your skill level, and your time. If it needs to be running 24/7 then obviously you must handle it. It depends on the situation. Perhaps it may be possible to try a slower algorithm but one that requires less heap. Maybe you can add functionality so that if OOM does occur your app is capable of cleaning itself up, and so you can try again.
So I think the answer is 'ALL OF THE ABOVE!', apart from LET IT CRASH. You take pride in your work, right?
Don't fall into the 'there's loads of memory so it probably won't happen' trap. If every app writer took that attitude you'd see OOM far more often, and not all apps are running on a desktop machines, take a mobile phone for example, it's highly likely for you to run into OOM on a RAM starved platform like that, trust me!
If all else fails display a useful message (assuming there's enough memory for a MessageBox!)

Random Complete System Unresponsiveness Running Mathematical Functions

I have a program that loads a file (anywhere from 10MB to 5GB) a chunk at a time (ReadFile), and for each chunk performs a set of mathematical operations (basically calculates the hash).
After calculating the hash, it stores info about the chunk in an STL map (basically <chunkID, hash>) and then writes the chunk itself to another file (WriteFile).
That's all it does. This program will cause certain PCs to choke and die. The mouse begins to stutter, the task manager takes > 2 min to show, ctrl+alt+del is unresponsive, running programs are slow.... the works.
I've done literally everything I can think of to optimize the program, and have triple-checked all objects.
What I've done:
Tried different (less intensive) hashing algorithms.
Switched all allocations to nedmalloc instead of the default new operator
Switched from stl::map to unordered_set, found the performance to still be abysmal, so I switched again to Google's dense_hash_map.
Converted all objects to store pointers to objects instead of the objects themselves.
Caching all Read and Write operations. Instead of reading a 16k chunk of the file and performing the math on it, I read 4MB into a buffer and read 16k chunks from there instead. Same for all write operations - they are coalesced into 4MB blocks before being written to disk.
Run extensive profiling with Visual Studio 2010, AMD Code Analyst, and perfmon.
Set the thread priority to THREAD_MODE_BACKGROUND_BEGIN
Set the thread priority to THREAD_PRIORITY_IDLE
Added a Sleep(100) call after every loop.
Even after all this, the application still results in a system-wide hang on certain machines under certain circumstances.
Perfmon and Process Explorer show minimal CPU usage (with the sleep), no constant reads/writes from disk, few hard pagefaults (and only ~30k pagefaults in the lifetime of the application on a 5GB input file), little virtual memory (never more than 150MB), no leaked handles, no memory leaks.
The machines I've tested it on run Windows XP - Windows 7, x86 and x64 versions included. None have less than 2GB RAM, though the problem is always exacerbated under lower memory conditions.
I'm at a loss as to what to do next. I don't know what's causing it - I'm torn between CPU or Memory as the culprit. CPU because without the sleep and under different thread priorities the system performances changes noticeably. Memory because there's a huge difference in how often the issue occurs when using unordered_set vs Google's dense_hash_map.
What's really weird? Obviously, the NT kernel design is supposed to prevent this sort of behavior from ever occurring (a user-mode application driving the system to this sort of extreme poor performance!?)..... but when I compile the code and run it on OS X or Linux (it's fairly standard C++ throughout) it performs excellently even on poor machines with little RAM and weaker CPUs.
What am I supposed to do next? How do I know what the hell it is that Windows is doing behind the scenes that's killing system performance, when all the indicators are that the application itself isn't doing anything extreme?
Any advice would be most welcome.
I know you said you had monitored memory usage and that it seems minimal here, but the symptoms sound very much like the OS thrashing like crazy, which would definitely cause general loss of OS responsiveness like you're seeing.
When you run the application on a file say 1/4 to 1/2 the size of available physical memory, does it seem to work better?
What I suspect may be happening is that Windows is "helpfully" caching your disk reads into memory and not giving up that cache memory to your application for use, forcing it to go to swap. Thus, even though swap use is minimal (150MB), it's going in and out constantly as you calculate the hash. This then brings the system to its knees.
Some things to check:
Antivirus software. These often scan files as they're opened to check for viruses. Is your delay occuring before any data is read by the application?
General system performance. Does copying the file using Explorer also show this problem?
Your code. Break it down into the various stages. Write a program that just reads the file, then one that reads and writes the files, then one that just hashes random blocks of ram (i.e. remove the disk IO part) and see if any particular step is problematic. If you can get a profiler then use this as well to see if there any slow spots in your code.
EDIT
More ideas. Perhaps your program is holding on to the GDI lock too much. This would explain everything else being slow without high CPU usage. Only one app at a time can have the GDI lock. Is this a GUI app, or just a simple console app?
You also mentioned RtlEnterCriticalSection. This is a costly operation, and can hang the system quite easily, i.e. mismatched Enters and Leaves. Are you multi-threading at all? Is the slow down due to race conditions between threads?
XPerf is your guide here - watch the PDC Video about it, and then take a trace of the misbehaving app. It will tell you exactly what's happening throughout the system, it is extremely powerful.
I like the disk-caching/thrashing suggestions, but if that's not it, here are some scattershot suggestions:
What non-MSVC libraries, if any, are you linking to?
Can your program be modified (#ifdef'd) to run without a GUI? Does the problem occur?
You added ::Sleep(100) after each loop in each thread, right? How many threads are you talking about? A handful or hundreds? How long does each loop take, roughly? What happens if you make that ::Sleep(10000)?
Is your program perhaps doing something else that locks a limited resources (ProcExp can show you what handles are being acquired ... of course you might have difficulty with ProcExp not responding:-[)
Are you sure CriticalSections are userland-only? I recall that was so back when I worked on Windows (or so I believed), but Microsoft could have modified that. I don't see any guarantee in the MSDN article Critical Section Objects (http://msdn.microsoft.com/en-us/library/ms682530%28VS.85%29.aspx) ... and this leads me to wonder: Anti-convoy locks in Windows Server 2003 SP1 and Windows Vista
Hmmm... presumably we're all multi-processor now, so are you setting the spin count on the CS?
How about running a debugging version of one of these OSes and monitoring the kernel debugging output (using DbgView)... possibly using the kernel debugger from the Platform SDK ... if MS still calls it that?
I wonder whether VMMap (another SysInternal/MS utility) might help with the Disk caching hypothesis.
It turns out that this is a bug in the Visual Studio compiler. Using a different compiler resolves the issue entirely.
In my case, I installed and used the Intel C++ Compiler and even with all optimizations disabled I did not see the fully-system hang that I was experiencing w/ the Visual Studio 2005 - 2010 compilers on this library.
I'm not certain as to what is causing the compiler to generate such broken code, but it looks like we'll be buying a copy of the Intel compiler.
It sounds like you're poking around fixing things without knowing what the problem is. Take stackshots. They will tell you what your program is doing when the problem occurs. It might not be easy to get the stackshots if the problem occurs on other machines where you cannot use an IDE or a stack sampler. One possibility is to kill the app and get a stack dump when it's acting up. You need to reproduce the problem in an environment where you can get a stack dump.
Added: You say it performs well on OSX and Linux, and poorly on Windows. I assume the ratio of completion time is some fairly large number, like 10 or 100, if you've even had the patience to wait for it. I said this in the comment, but it is a key point. The program is waiting for something, and you need to find out what. It could be any of the things people mentioned, but it is not random.
Every program, all the time while it runs, has a call stack consisting of a hierarchy of call instructions at specific addresses. If at a point in time it is calculating, the last instruction on the stack is a non-call instruction. If it is in I/O the stack may reach into a few levels of library calls that you can't see into. That's OK. Every call instruction on the stack is waiting. It is waiting for the work it requested to finish. If you look at the call stack, and look at where the call instructions are in your code, you will know what your program is waiting for.
Your program, since it is taking so long to complete, is spending nearly all of its time waiting for something to finish, and as I said, that's what you need to find out. Get a stack dump while it's being slow, and it will give you the answer. The chance that it will miss it is 1/the-slowness-ratio.
Sorry to be so elemental about this, but lots of people (and profiler makers) don't get it. They think they have to measure.

Tracing memory corruption on a production linux server

Guys, could you please recommend a tool for spotting a memory corruption on a production multithreaded server built with c++ and working under linux x86_64? I'm currently facing the following problem : every several hours my server crashes with a segfault and the core dump shows that error happens in malloc/calloc which is definitely a sign of memory being corrupted somewhere.
Actually I have already tried some tools without much luck. Here is my experience so far:
Valgrind is a great(I'd even say best) tool but it slows down the server too much making it unusable in production. I tried it on a stage server and it really helped me find some memory related issues but even after fixing them I still get crashes on the production server. I ran my stage server under Valgrind for several hours but still couldn't spot any serious errors.
ElectricFence is said to be a real memory hog but I couldn't even get it working properly. It segfaults almost immediately on the stage server in random weird places where Valgrind didn't show any issues at all. Maybe ElectricFence doesn't support threading well?.. I have no idea.
DUMA - same story as ElectricFence but even worse. While EF produced core dumps with readable backtraces DUMA shows me only "?????"(and yes server is built with -g flag for sure)
dmalloc - I configured the server to use it instead of standard malloc routines however it hangs after several minutes. Attaching a gdb to the process reveals it's hung somewhere in dmalloc :(
I'm gradually getting crazy and simply don't know what to do next. I have the following tools to be tried: mtrace, mpatrol but maybe someone has a better idea?
I'd greatly appreciate any help on this issue.
Update: I managed to find the source of the bug. However I found it on the stage server not production one using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...
Yes, C/C++ memory corruption problems are tough.
I also used several times valgrind, sometimes it revealed the problem and sometimes not.
While examining valgrind output don't tend to ignore its result too fast. Sometimes after a considerable time spent, you'll see that valgrind gave you the clue on the first place, but you ignored it.
Another advise is to compare the code changes from previously known stable release. It's not a problem if you use some sort of source versioning system (e.g. svn). Examine all memory related functions (e.g. memcpy, memset, sprintf, new, delete/delete[]).
Compile your program with gcc 4.1 and the -fstack-protector-all switch. If the memory corruption is caused by stack smashing this should be able to detect it. You might need to play with some of the additional parameters of SSP.
Folks, I managed to find the source of the bug. However I found it on the stage server using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...
Have you tried -fmudflap? (scroll up a few lines to see the options available).
you can try IBM purify, but i am afraid that is not opensource..
The Google Perftools --- which is Open Source --- may be of help, see the heap checker documentation.
Try this one:
http://www.hexco.de/rmdebug/
I used it extensively, its has a low impact in performance(it mostly impacts amount of ram) but the allocation algorithm is the same. Its always proven enough to find any allocation bugs. Your program will crash as soon as the bug occurs, and it will have a detailed log.
I'm not sure if it would have caught your particular bug, but the MALLOC_CHECK_ environment variable (malloc man page) turns on additional checking in the default Linux malloc implementation, and typically doesn't have a significant runtime cost.

Heap corruption under Win32; how to locate?

I'm working on a multithreaded C++ application that is corrupting the heap. The usual tools to locate this corruption seem to be inapplicable. Old builds (18 months old) of the source code exhibit the same behaviour as the most recent release, so this has been around for a long time and just wasn't noticed; on the downside, source deltas can't be used to identify when the bug was introduced - there are a lot of code changes in the repository.
The prompt for crashing behaviuor is to generate throughput in this system - socket transfer of data which is munged into an internal representation. I have a set of test data that will periodically cause the app to exception (various places, various causes - including heap alloc failing, thus: heap corruption).
The behaviour seems related to CPU power or memory bandwidth; the more of each the machine has, the easier it is to crash. Disabling a hyper-threading core or a dual-core core reduces the rate of (but does not eliminate) corruption. This suggests a timing related issue.
Now here's the rub:
When it's run under a lightweight debug environment (say Visual Studio 98 / AKA MSVC6) the heap corruption is reasonably easy to reproduce - ten or fifteen minutes pass before something fails horrendously and exceptions, like an alloc; when running under a sophisticated debug environment (Rational Purify, VS2008/MSVC9 or even Microsoft Application Verifier) the system becomes memory-speed bound and doesn't crash (Memory-bound: CPU is not getting above 50%, disk light is not on, the program's going as fast it can, box consuming 1.3G of 2G of RAM). So, I've got a choice between being able to reproduce the problem (but not identify the cause) or being able to idenify the cause or a problem I can't reproduce.
My current best guesses as to where to next is:
Get an insanely grunty box (to replace the current dev box: 2Gb RAM in an E6550 Core2 Duo); this will make it possible to repro the crash causing mis-behaviour when running under a powerful debug environment; or
Rewrite operators new and delete to use VirtualAlloc and VirtualProtect to mark memory as read-only as soon as it's done with. Run under MSVC6 and have the OS catch the bad-guy who's writing to freed memory. Yes, this is a sign of desperation: who the hell rewrites new and delete?! I wonder if this is going to make it as slow as under Purify et al.
And, no: Shipping with Purify instrumentation built in is not an option.
A colleague just walked past and asked "Stack Overflow? Are we getting stack overflows now?!?"
And now, the question: How do I locate the heap corruptor?
Update: balancing new[] and delete[] seems to have gotten a long way towards solving the problem. Instead of 15mins, the app now goes about two hours before crashing. Not there yet. Any further suggestions? The heap corruption persists.
Update: a release build under Visual Studio 2008 seems dramatically better; current suspicion rests on the STL implementation that ships with VS98.
Reproduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.
I'll take a note of that, but I'm concerned that Dr Watson will only be tripped up after the fact, not when the heap is getting stomped on.
Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.
Got that going at the moment, again: not much help until something goes wrong. I want to catch the vandal in the act.
Maybe these tools will allow you at least to narrow the problem to certain component.
I don't hold much hope, but desperate times call for...
And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?
No I'm not, and I'll spend a couple of hours tomorrow going through the workspace (58 projects in it) and checking they're all compiling and linking with the appropriate flags.
Update: This took 30 seconds. Select all projects in the Settings dialog, unselect until you find the project(s) that don't have the right settings (they all had the right settings).
My first choice would be a dedicated heap tool such as pageheap.exe.
Rewriting new and delete might be useful, but that doesn't catch the allocs committed by lower-level code. If this is what you want, better to Detour the low-level alloc APIs using Microsoft Detours.
Also sanity checks such as: verify your run-time libraries match (release vs. debug, multi-threaded vs. single-threaded, dll vs. static lib), look for bad deletes (eg, delete where delete [] should have been used), make sure you're not mixing and matching your allocs.
Also try selectively turning off threads and see when/if the problem goes away.
What does the call stack etc look like at the time of the first exception?
I have same problems in my work (we also use VC6 sometimes). And there is no easy solution for it. I have only some hints:
Try with automatic crash dumps on production machine (see Process Dumper). My experience says Dr. Watson is not perfect for dumping.
Remove all catch(...) from your code. They often hide serious memory exceptions.
Check Advanced Windows Debugging - there are lots of great tips for problems like yours. I recomend this with all my heart.
If you use STL try STLPort and checked builds. Invalid iterator are hell.
Good luck. Problems like yours take us months to solve. Be ready for this...
We've had pretty good luck by writing our own malloc and free functions. In production, they just call the standard malloc and free, but in debug, they can do whatever you want. We also have a simple base class that does nothing but override the new and delete operators to use these functions, then any class you write can simply inherit from that class. If you have a ton of code, it may be a big job to replace calls to malloc and free to the new malloc and free (don't forget realloc!), but in the long run it's very helpful.
In Steve Maguire's book Writing Solid Code (highly recommended), there are examples of debug stuff that you can do in these routines, like:
Keep track of allocations to find leaks
Allocate more memory than necessary and put markers at the beginning and end of memory -- during the free routine, you can ensure these markers are still there
memset the memory with a marker on allocation (to find usage of uninitialized memory) and on free (to find usage of free'd memory)
Another good idea is to never use things like strcpy, strcat, or sprintf -- always use strncpy, strncat, and snprintf. We've written our own versions of these as well, to make sure we don't write off the end of a buffer, and these have caught lots of problems too.
Run the original application with ADplus -crash -pn appnename.exe
When the memory issue pops-up you will get a nice big dump.
You can analyze the dump to figure what memory location was corrupted.
If you are lucky the overwrite memory is a unique string you can figure out where it came from. If you are not lucky, you will need to dig into win32 heap and figure what was the orignal memory characteristics. (heap -x might help)
After you know what was messed-up, you can narrow appverifier usage with special heap settings. i.e. you can specify what DLL you monitor, or what allocation size to monitor.
Hopefully this will speedup the monitoring enough to catch the culprit.
In my experience, I never needed full heap verifier mode, but I spent a lot of time analyzing the crash dump(s) and browsing sources.
P.S:
You can use DebugDiag to analyze the dumps.
It can point out the DLL owning the corrupted heap, and give you other usefull details.
You should attack this problem with both runtime and static analysis.
For static analysis consider compiling with PREfast (cl.exe /analyze). It detects mismatched delete and delete[], buffer overruns and a host of other problems. Be prepared, though, to wade through many kilobytes of L6 warning, especially if your project still has L4 not fixed.
PREfast is available with Visual Studio Team System and, apparently, as part of Windows SDK.
Is this in low memory conditions? If so it might be that new is returning NULL rather than throwing std::bad_alloc. Older VC++ compilers didn't properly implement this. There is an article about Legacy memory allocation failures crashing STL apps built with VC6.
The apparent randomness of the memory corruption sounds very much like a thread synchronization issue - a bug is reproduced depending on machine speed. If objects (chuncks of memory) are shared among threads and synchronization (critical section, mutex, semaphore, other) primitives are not on per-class (per-object, per-class) basis, then it is possible to come to a situation where class (chunk of memory) is deleted / freed while in use, or used after deleted / freed.
As a test for that, you could add synchronization primitives to each class and method. This will make your code slower because many objects will have to wait for each other, but if this eliminates the heap corruption, your heap-corruption problem will become a code optimization one.
You tried old builds, but is there a reason you can't keep going further back in the repository history and seeing exactly when the bug was introduced?
Otherwise, I would suggest adding simple logging of some kind to help track down the problem, though I am at a loss of what specifically you might want to log.
If you can find out what exactly CAN cause this problem, via google and documentation of the exceptions you are getting, maybe that will give further insight on what to look for in the code.
My first action would be as follows:
Build the binaries in "Release" version but creating debug info file (you will find this possibility in project settings).
Use Dr Watson as a defualt debugger (DrWtsn32 -I) on a machine on which you want to reproduce the problem.
Repdroduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.
Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.
Maybe these tools will allow you at least to narrow the problem to certain component.
And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?
So from the limited information you have, this can be a combination of one or more things:
Bad heap usage, i.e., double frees, read after free, write after free, setting the HEAP_NO_SERIALIZE flag with allocs and frees from multiple threads on the same heap
Out of memory
Bad code (i.e., buffer overflows, buffer underflows, etc.)
"Timing" issues
If it's at all the first two but not the last, you should have caught it by now with either pageheap.exe.
Which most likely means it is due to how the code is accessing shared memory. Unfortunately, tracking that down is going to be rather painful. Unsynchronized access to shared memory often manifests as weird "timing" issues. Things like not using acquire/release semantics for synchronizing access to shared memory with a flag, not using locks appropriately, etc.
At the very least, it would help to be able to track allocations somehow, as was suggested earlier. At least then you can view what actually happened up until the heap corruption and attempt to diagnose from that.
Also, if you can easily redirect allocations to multiple heaps, you might want to try that to see if that either fixes the problem or results in more reproduceable buggy behavior.
When you were testing with VS2008, did you run with HeapVerifier with Conserve Memory set to Yes? That might reduce the performance impact of the heap allocator. (Plus, you have to run with it Debug->Start with Application Verifier, but you may already know that.)
You can also try debugging with Windbg and various uses of the !heap command.
MSN
Graeme's suggestion of custom malloc/free is a good idea. See if you can characterize some pattern about the corruption to give you a handle to leverage.
For example, if it is always in a block of the same size (say 64 bytes) then change your malloc/free pair to always allocate 64 byte chunks in their own page. When you free a 64 byte chunk then set the memory protection bits on that page to prevent reads and wites (using VirtualQuery). Then anyone attempting to access this memory will generate an exception rather than corrupting the heap.
This does assume that the number of outstanding 64 byte chunks is only moderate or you have a lot of memory to burn in the box!
If you choose to rewrite new/delete, I have done this and have simple source code at:
http://gandolf.homelinux.org/~smhanov/blog/?id=10
This catches memory leaks and also inserts guard data before and after the memory block to capture heap corruption. You can just integrate with it by putting #include "debug.h" at the top of every CPP file, and defining DEBUG and DEBUG_MEM.
The little time I had to solve a similar problem.
If the problem still exists I suggest you do this :
Monitor all calls to new/delete and malloc/calloc/realloc/free.
I make single DLL exporting a function for register all calls. This function receive parameter for identifying your code source, pointer to allocated area and type of call saving this information in a table.
All allocated/freed pair is eliminated. At the end or after you need you make a call to an other function for create report for left data.
With this you can identify wrong calls (new/free or malloc/delete) or missing.
If have any case of buffer overwritten in your code the information saved can be wrong but each test may detect/discover/include a solution of failure identified. Many runs to help identify the errors.
Good luck.
Do you think this is a race condition? Are multiple threads sharing one heap? Can you give each thread a private heap with HeapCreate, then they can run fast with HEAP_NO_SERIALIZE. Otherwise, a heap should be thread safe, if you're using the multi-threaded version of the system libraries.
A couple of suggestions. You mention the copious warnings at W4 - I would suggest taking the time to fix your code to compile cleanly at warning level 4 - this will go a long way to preventing subtle hard to find bugs.
Second - for the /analyze switch - it does indeed generate copious warnings. To use this switch in my own project, what I did was to create a new header file that used #pragma warning to turn off all the additional warnings generated by /analyze. Then further down in the file, I turn on only those warnings I care about. Then use the /FI compiler switch to force this header file to be included first in all your compilation units. This should allow you to use the /analyze switch while controling the output