Related
I'm trying to understand how the objects (variables, functions, structs, etc) work in c++. In this case I see there are basically two ways of storing them: the stack and the heap. Accordingly, whenever the heap storage is used it needs to be dealocated manually, but if the stack is used, then the dealocation is automaticcally done. so my question is related to the kinds of problems that bad practice might cause the program itself or to the computer. For example:
1.- Let'suposse that I run a program with a recursion solution by using an infinite iteration of functions. Theoretically the program crashes (stack overflow), but does it cause some trouble to the computer itself? (To the RAM maybe or to the SO).
2.- What happens if I forget to dealocate memory on the heap. I mean, does it just cause trouble to the program or it is permanent to the computer in general. I mean it might be that such memory could not be used never again or something.
3.- What are the problems of getting a segmentation fault (the heap).
Some other dangers or cares relevant to this are welcome.
Accordingly, whenever the stack storage is used it needs to be
dealocated manually, but if the heap is used, then the dealocation is
automaticcally done.
When you use stack - local variables in the function - they are deallocated automatically when the function ends (returns).
When you allocate from the heap, the memory allocated remains "in use" until it is freed. If you don't do that, your program, if it runes for long enough and keep allocating "stuff", will use all memory available to it, and eventually fail.
Note that "stackfault" is almost impossible to recover from in an application, because the stack is no longer usable when it's full, and most operations to "recover from error" will involve using SOME stack memory. The processor typically has a special trap to recover from stack fault, but that lands insise the operating system, and if the OS determines the application has run out of stack, it often shows no mercy at all - it just "kills" the application immediately.
1.- Let'suposse that I run a program with a recursion solution by using an infinite iteration of functions. Theoretically the program
crashes (stack overflow), but does it cause some trouble to the
computer itself? (To the RAM maybe or to the SO).
No, the computer itself is not harmed by this in and of itself. There may of course be data-loss if your program wasn't saving something that the user was working on.
Unless the hardware is very badly designed, it's very hard to write code that causes any harm to the computer, beyond loss of stored data (of course, if you write a program that fills the entire hard disk from the first to the last sector, your data will be overwritten with whatever your program fills the disk with - which may well cause the machine to not boot again until you have re-installed an operating system on the disk). But RAM and processors don't get damaged by bad coding (fortunately, as most programmers make mistakes now and again).
2.- What happens if I forget to dealocate memory on the heap. I mean, does it just cause trouble to the program or it is permanent to the
computer in general. I mean it might be that such memory could not be
used never again or something.
Once the program finishes (and most programs that use "too much memory" does terminate in some way or another, at some point).
Of course, how well the operating system and other applications handle "there is no memory at all available" varies a little bit. The operating system in itself is generally OK with it, but some drivers that are badly written may well crash, and thus cause your system to reboot if you are unlucky. Applications are more prone to crashing due to there not being enough memory, because allocations end up with NULL (zero) as the "returned address" when there is no memory available. Using address zero in a modern operating system will almost always lead to a "Segmentation fault" or similar problem (see below for more on that).
But these are extreme cases, most systems are set up such that one application gobbling all available memory will in itself fail before the rest of the system is impacted - not always, and it's certainly not guaranteed that the application "causing" the problem is the first one to be killed if the OS kills applications simply because they "eat a lot of memory". Linux does have a "Out of memory killer", which is a pretty drastic method to ensure the system can continue to work [by some definition of "work"].
3.- What are the problems of getting a segmentation fault (the heap).
Segmentation faults don't directly have anything to do with the heap. The term segmentation fault comes from older operating systems (Unix-style) that used "segments" of memory for different usages, and "Segmentation fault" was when the program went outside it's allocated segment. In modern systems, the memory is split into "pages" - typically 4KB each, but some processors have larger pages, and many modern processors support "large pages" of, for examble, 2MB or 1GB, which is used for large chunks of memory.
Now, if you use an address that points to a page that isn't there (or isn't "yours"), you get a segmentation fault. This, typically will end the application then and there. You can "trap" segmentation fault, but in all operating systems I'm aware of, it's not valid to try to continue from this "trap" - but you could for example store away some files to explain what happened and help troubleshoot the problem later, etc.
Firstly, your understanding of stack/heap allocations is backwards: stack-allocated data is automatically reclaimed when it goes out of scope. Dynamically-allocated data (data allocated with new or malloc), which is generally heap-allocated data, must be manually reclaimed with delete/free. However, you can use C++ destructors (RAII) to automatically reclaim dynamically-allocated resources.
Secondly, the 3 questions you ask have nothing to do with the C++ language, but rather they are only answerable with respect to the environment/operating system you run a C++ program in. Modern operating systems generally isolate processes so that a misbehaving process doesn't trample over OS memory or other running programs. In Linux, for example, each process has its own address space which is allocated by the kernel. If a misbehaving process attempts to write to a memory address outside of its allocated address space, the operating system will send a SIGSEGV (segmentation fault) which usually aborts the process. Older operating systems, such as MS-DOS, didn't have this protection, and so writing to an invalid pointer or triggering a stack overflow could potentially crash the whole operating system.
Likewise, with most mainstream modern operating systems (Linux/UNIX/Windows, etc.), memory leaks (data which is allocated dynamically but never reclaimed) only affect the process which allocated them. When the process terminates, all memory allocated by the process is reclaimed by the OS. But again, this is a feature of the operating system, and has nothing to do with the C++ language. There may be some older operating systems where leaked memory is never reclaimed, even by the OS.
1.- Let'suposse that I run a program with a recursion solution by using an infinite iteration of functions. Theoretically the program crashes (stack overflow), but does it cause some trouble to the computer itself? (To the RAM maybe or to the SO).
A stack overflow should not cause trouble neither to the Operating System nor to the computer. Any modern OS provides an isolated address space to each process. When a process tries to allocate more data in its stack than space is available, the OS detects it (usually via an exception) and terminates the process. This guarantees that no other processes are affected.
2.- What happens if I forget to dealocate memory on the heap. I mean, does it just cause trouble to the program or it is permanent to the computer in general. I mean it might be that such memory could not be used never again or something.
It depends on whether your program is a long running process or not, and the amount of data that you're failing to deallocate. In a long running process (e.g. a server) a recurrent memory leak can lead to thrashing: after some time, your process will be using so much memory that it won't fit in your physical memory. This is not a problem per se, because the OS provides virtual memory but the OS will spend more time moving memory pages from your physical memory to disk than doing useful work. This can affect other processes and it might slow down the system significantly (to the point that it might be better to reboot it).
3.- What are the problems of getting a segmentation fault (the heap).
A Segmentation Fault will crash your process. It's not directly related to the usage of the heap, but rather to accessing a memory region that does not belong to your process (because it's not part of its address space or because it was, but it was freed). Depending on what your process was doing, this can cause other problems: for instance, if the process was writing to a file when the crash happened it's very likely that it will end up corrupt.
First, stack means automatic memory and heap means manual memory. There are ways around both, but that's generally a more advanced question.
On modern operating systems, your application will crash but the operating system and machine as a whole will continue to function. There are of course exceptions to this rule, but they're (again) a more advanced topic.
Allocating from the heap and then not deallocating when you're done just means that your program is still considered to be using the memory even though you're not using it. If left unchecked, your program will fail to allocate memory (out of memory errors). How you handle out-of-memory errors could mean anything from a crash (unhandled error resulting in an unhandled exception or a NULL pointer being accessed and generating a segmentation fault) to odd behavior (caught exception or tested for NULL pointer but no special handling case) to nothing at all (properly handled).
On modern operating systems, the memory will be freed when your application exits.
A segmentation fault in the normal sense will simply crash your application. The operating system may immediately close file or socket handles. It may also perform a dump of your application's memory so that you can try to debug it posthumously with tools designed to do that (more advanced subject).
Alternatively, most (I think?) modern operating systems will use a special method of telling the program that it has done something bad. It is then up to the program's code to decide whether or not it can recover from that or perhaps add additional debug information for the operating system, or whatever really.
I suggest you look into auto pointers (also called smart pointers) for making your heap behave a little bit like a stack -- automatically deallocating memory when you're done using it. If you're using a modern compiler, see std::unique_ptr. If that type name can't be found, then look into the boost library (google). It's a little more advanced but highly valuable knowledge.
Hope this helps.
My question has two parts:
Is it possible that, if a segfault occurs after allocating memory but before freeing it, this leaks memory (that is, the memory is never freed resulting in a memory leak)?
If so, is there any way to ensure allocated memory is cleaned up in the event of a segfault?
I've been reading about memory manamgement in C++ but was unable to find anything about my specific question.
In the event of a seg fault, the OS is responsible for cleaning up all the resources held by your program.
Edit:
Modern operating systems will clean up any leaked memory regardless of how your program terminates. The memory is only leaked for the life of your program. Most OS's will also clean up many other types of resources such as open files and socket connections.
Is it possible that, if a segfault occurs after allocating memory but
before freeing it, this leaks memory (that is, the memory is never
freed resulting in a memory leak)?
Yes and No: The process which crashes should be tiedied completely by the OS. However consider other processes spawned by your process: They might not get terminated completely. However usually these shouldn't take too many resources at all, but this may vary depending on your program. See http://en.wikipedia.org/wiki/Zombie_process
If so, is there any way to ensure allocated memory is cleaned up in
the event of a segfault?
In case the program is non critical (meaning there are no lives at stake if it crashes) I suggest fixing the segmentation fault. If you really need to be able to handle segmentation faults see the answer on this topic: How to catch segmentation fault in Linux?
UPDATE: Please note that despite the fact that it is possible to handle SIGSEGV signals (and continuning in program flow) it is not a secure way to rely on, since - as pointed out in the comments below - it is undefined behaviour meaning differen platforms/compilers/... may react differently.
So by any means possible fixing segmentation faults (as well as access violations on windows) should have first priority. Still using the suggested solution to handle signals this way must be thoroughly tested and if put in production code you must be aware of it and draw any consequences - which may vary and depend on your requirements so I will not name any.
The C++ standard does not concern itself with seg-faults (that's a platform-specific thing).
In practice, it really depends on what you do, and what your definition of "memory leak" is. In theory, you can register a handler for the seg-fault signal, in which you can do all necessary cleanup. However, any modern OS will automatically clear up a terminating process anyway.
One, there are resources the system is responsible for cleaning up. One of them is memory. You do not have to worry about permanently leaked RAM on a segfault.
Two, there are resources that the system is not responsible for cleaning up. You could write a program that inserts its pid into a database and removes it on close. That won't be removed on a segfault. You can either 1) add a handler to clean up that sort of non-system resources, or 2) fix the bugs in your program to begin with.
Modern operating systems separate applications' memory as to be able to clean up after them. The problem with segmentation faults is that they normally only occur when something goes wrong. At this point, the default behavior is to shut down the application because it's no longer functioning as expected.
Likewise, unless under some bizarre circumstance, your application has likely done something you cannot account for if it has hit a segmentation fault. So, "cleaning up" is likely practically impossible. It's not out of the realm of possibility to use memory carefully similar to transactional databases in order to guarantee an acceptable roll-back state, however doing so (and depending on how fine-grained of a level) may be beyond tedious.
A more practical version may be to provide your own sort of sandboxing between application components and reboot a component if it dies, restoring it to a acceptable previous saved state wholesale. That way you can flush all of its allocated memory and have it start from scratch. You still lose whatever data it hadn't saved as of the last checkpoint, however.
I'm writing an caching app that consumes large amounts of memory.
Hopefully, I'll manage my memory well enough, but I'm just thinking about what
to do if I do run out of memory.
If a call to allocate even a simple object fails, is it likely that even a syslog call
will also fail?
EDIT: Ok perhaps I should clarify the question. If malloc or new returns a NULL or 0L value then it essentially means the call failed and it can't give you the memory for some reason. So, what would be the sensible thing to do in that case?
EDIT2: I've just realised that a call to "new" can throw an exception. This could be caught at a higher level so I can perhaps gracefully exit further up. At that point, it may even be possible to recover depending on how much memory is freed. In the least I should by that point hopefully be able to log something. So while I have seen code that checks the value of a pointer after new, it is unnecessary. While in C, you should check the return value for malloc.
Well, if you are in a case where there is a failure to allocate memory, you're going to get a std::bad_alloc exception. The exception causes the stack of your program to be unwound. In all likelihood, the inner loops of your application logic are not going to be handling out of memory conditions, only higher levels of your application should be doing that. Because the stack is getting unwound, a significant chunk of memory is going to be free'd -- which in fact should be almost all the memory used by your program.
The one exception to this is when you ask for a very large (several hundred MB, for example) chunk of memory which cannot be satisfied. When this happens though, there's usually enough smaller chunks of memory remaining which will allow you to gracefully handle the failure.
Stack unwinding is your friend ;)
EDIT: Just realized that the question was also tagged with C -- if that is the case, then you should be having your functions free their internal structures manually when out of memory conditions are found; not to do so is a memory leak.
EDIT2: Example:
#include <iostream>
#include <vector>
void DoStuff()
{
std::vector<int> data;
//insert a whole crapload of stuff into data here.
//Assume std::vector::push_back does the actual throwing
//i.e. data.resize(SOME_LARGE_VALUE_HERE);
}
int main()
{
try
{
DoStuff();
return 0;
}
catch (const std::bad_alloc& ex)
{ //Observe that the local variable `data` no longer exists here.
std::cerr << "Oops. Looks like you need to use a 64 bit system (or "
"get a bigger hard disk) for that calculation!";
return -1;
}
}
EDIT3: Okay, according to commenters there are systems out there which do not follow the standard in this regard. On the other hand, on such systems, you're going to be SOL in any case, so I don't see why they merit discussion. But if you are on such a platform, it is something to keep in mind.
Doesn't this question make assumptions regarding overcommitted memory?
I.e., an out of memory situation might not be recoverable! Even if you have no memory left, calls to malloc and other allocators may still succeed until the program attempts to use the memory. Then, BAM!, some process gets killed by the kernel in order to satisfy memory load.
I don't have any specific experience on Linux, but I spent a lot of time working in video games on games consoles, where running out of memory is verboten, and on Windows-based tools.
On a modern OS, you're most likely to run out of address space. Running out of memory, as such, is basically impossible. So just allocate a large buffer, or buffers, on startup, in order to hold all the data you'll ever need, whilst leaving a small amount for the OS. Writing random junk to these regions would probably be a good idea in order to force the OS to actually assign the memory to your process. If your process survives this attempt to use every byte it's asked for, there's some kind of backing now reserved for all of this stuff, so now you're golden.
Write/steal your own memory manager, and direct it to allocate from these buffers. Then use it, consistently, in your app, or take advantage of gcc's --wrap option to forward calls from malloc and friends appropriately. If you use any libraries that can't be directed to call into your memory manager, junk them, because they'll just get in your way. Lack of overridable memory management calls is evidence of deeper-seated issues; you're better of without this particular component. (Note: even if you're using --wrap, trust me, this is still evidence of a problem! Life is too short to use those libraries that don't let you overload their memory management!)
Once you run out of memory, OK, you're screwed, but you've still got that space you left free before, so if freeing up some of the memory you've asked for is too difficult you can (with care) call system calls to write a message to the system log and then terminate, or whatever. Just make sure to avoid calls to the C library, because they'll probably try to allocate some memory when you least expect it -- programmers who work with systems that have virtualised address spaces are notorious for this kind of thing -- and that's the very thing that has caused the problem in the first place.
This approach might sound like a pain in the arse. Well... it is. But it's straightforward, and it's worth putting in a bit of effort for that. I think there's a Kernighan-and/or-Ritche quote about this.
If your application is likely to allocate large blocks of memory and risks hitting the per-process or VM limits, waiting until an allocation actually fails is a difficult situation from which to recover. By the time malloc returns NULL or new throws std::bad_alloc, things may be too far gone to reliably recover. Depending on your recovery strategy, many operations may still require heap allocations themselves, so you have to be extremely careful on which routines you can rely.
Another strategy you may wish to consider is to query the OS and monitor the available memory, proactively managing your allocations. This way you can avoid allocating a large block if you know it is likely to fail, and will thus have a better chance of recovery.
Also, depending on your memory usage patterns, using a custom allocator may give you better results than the standard built-in malloc. For example, certain allocation patterns can actually lead to memory fragmentation over time, so even though you have free memory, the available blocks in the heap arena may not have an available block of the right size. A good example of this is Firefox, which switched to dmalloc and saw a great increase in memory efficiency.
I don't think that capturing the failure of malloc or new will gain you much in your situation. linux allocates large chunks of virtual pages in malloc by means of mmap. By this you may find yourself in a situation where you allocate much more virtual memory than you have (real + swap).
The program then will only fail much later with a segfault (SIGSEGV) when you write to the first page for which there isn't any place in swap. In theory you could test for such situations by writing a signal handler and then dirtying all pages that you allocate.
But usually this will not help much either, since your application will be in a very bad state long before that: constantly swapping, computing mechanically with your harddisk...
It's possible for writes to the syslog to fail in low memory conditions: there's no way to know that for every platform without looking at the source for the relevant functions. They could need dynamic memory to format strings that are passed in, for instance.
Long before you run out of memory, however, you'll start paging stuff to disk. And when that happens, you can forget any performance advantages from caching.
Personally, I'm convinced by the design behind Varnish: the operating system offers services to solve a lot of the relevant problems, and it makes sense to use those services (minor editing):
So what happens with Squid's elaborate memory management is that it gets into fights with the kernel's elaborate memory management ...
Squid creates a HTTP object in RAM and it gets used some times rapidly after creation. Then after some time it get no more hits and the kernel notices this. Then somebody tries to get memory from the kernel for something and the kernel decides to push those unused pages of memory out to swap space and use the (cache-RAM) more sensibly for some data which is actually used by a program. This however, is done without Squid knowing about it. Squid still thinks that these http objects are in RAM, and they will be, the very second it tries to access them, but until then, the RAM is used for something productive. ...
After some time, Squid will also notice that these objects are unused, and it decides to move them to disk so the RAM can be used for more busy data. So Squid goes out, creates a file and then it writes the http objects to the file.
Here we switch to the high-speed camera: Squid calls write(2), the address it gives is a "virtual address" and the kernel has it marked as "not at home". ...
The kernel tries to find a free page, if there are none, it will take a little used page from somewhere, likely another little used Squid object, write it to the paging ... space on the disk (the "swap area") when that write completes, it will read from another place in the paging pool the data it "paged out" into the now unused RAM page, fix up the paging tables, and retry the instruction which failed. ...
So now Squid has the object in a page in RAM and written to the disk two places: one copy in the operating system's paging space and one copy in the filesystem. ...
Here is how Varnish does it:
Varnish allocate some virtual memory, it tells the operating system to back this memory with space from a disk file. When it needs to send the object to a client, it simply refers to that piece of virtual memory and leaves the rest to the kernel.
If/when the kernel decides it needs to use RAM for something else, the page will get written to the backing file and the RAM page reused elsewhere.
When Varnish next time refers to the virtual memory, the operating system will find a RAM page, possibly freeing one, and read the contents in from the backing file.
And that's it. Varnish doesn't really try to control what is cached in RAM and what is not, the kernel has code and hardware support to do a good job at that, and it does a good job.
You may not need to write caching code at all.
As has been stated, exhausting memory means that all bets are off. IMHO the best method of handling this situation is to fail gracefully (as opposed to simply crashing!). Your cache could allocate a reasonable amount of memory on instantiation. The size of this memory would equate to an amount that, when freed, will allow the program to terminate reasonably. When your cache detects that memory is becoming low then it should release this memory and instigate a graceful shutdown.
I'm writing an caching app that consumes large amounts of memory.
Hopefully, I'll manage my memory well enough, but I'm just thinking about what to do if I do run out of memory.
If you are writing deamon which should run 24/7/365, then you should not use dynamic memory management: preallocate all the memory in advance and manage it using some slab allocator/memory pool mechanism. That will also protect you again the heap fragmentation.
If a call to allocate even a simple object fails, is it likely that even a syslog call will also fail?
Should not. This is partially reason why syslog exists as a syscall: that application can report an error independent of its internal state.
If malloc or new returns a NULL or 0L value then it essentially means the call failed and it can't give you the memory for some reason. So, what would be the sensible thing to do in that case?
I generally try in the situations to properly handle the error condition, applying the general error handling rules. If error happens during initialization - terminate with error, probably configuration error. If error happens during request processing - fail the request with out-of-memory error.
For plain heap memory, malloc() returning 0 generally means:
that you have exhausted the heap and unless your application free some memory, further malloc()s wouldn't succeed.
the wrong allocation size: it is quite common coding error to mix signed and unsigned types when calculating block size. If the size ends up mistakenly negative, passed to malloc() where size_t is expected, it becomes very large number.
So in some sense it is also not wrong to abort() to produce the core file which can be analyzed later to see why the malloc() returned 0. Though I prefer to (1) include the attempted allocation size in the error message and (2) try to proceed further. If application would crash due to other memory problem down the road (*), it would produce core file anyway.
(*) From my experience of making software with dynamic memory management resilient to malloc() errors I see that often malloc() returns 0 not reliably. First attempts returning 0 are followed by a successful malloc() returning valid pointer. But first access to the pointed memory would crash the application. This is my experience on both Linux and HP-UX - and I have seen similar pattern on Solaris 10 too. The behavior isn't unique to Linux. To my knowledge the only way to make an application 100% resilient to memory problems is to preallocate all memory in advance. And that is mandatory for mission critical, safety, life support and carrier grade applications - they are not allowed dynamic memory management past initialization phase.
I don't know why many of the sensible answers are voted down. In most server environments, running out of memory means that you have a leak somewhere, and that it makes little sense to 'free some memory and try to go on'. The nature of C++ and especially the standard library is that it requires allocations all the time. If you are lucky, you might be able to free some memory and execute a clean shutdown, or at least emit a warning.
It is however far more likely that you won't be able to do a thing, unless the allocation that failed was a huge one, and there is still memory available for 'normal' things.
Dan Bernstein is one of the very few guys I know that can implement server software that operates in memory constrained situations.
For most of the rest of us, we should probably design our software that it leaves things in a useful state when it bails out because of an out of memory error.
Unless you are some kind of brain surgeon, there isn't a lot else to do.
Also, very often you won't even get an std::bad_alloc or something like that, you'll just get a pointer in return to your malloc/new, and only die when you actually try to touch all of that memory. This can be prevented by turning off overcommit in the operating system, but still.
Don't count on being able to deal with the SIGSEGV when you touch memory that the kernel hoped you wouldn't be.. I'm not quite sure how this works on the windows side of things, but I bet they do overcommit too.
All in all, this is not one of C++'s strong spots.
I have a program that implements several heuristic search algorithms and several domains, designed to experimentally evaluate the various algorithms. The program is written in C++, built using the GNU toolchain, and run on a 64-bit Ubuntu system. When I run my experiments, I use bash's ulimit command to limit the amount of virtual memory the process can use, so that my test system does not start swapping.
Certain algorithm/test instance combinations hit the memory limit I have defined. Most of the time, the program throws an std::bad_alloc exception, which is printed by the default handler, at which point the program terminates. Occasionally, rather than this happening, the program simply segfaults.
Why does my program occasionally segfault when out of memory, rather than reporting an unhandled std::bad_alloc and terminating?
One reason might be that by default Linux overcommits memory. Requesting memory from the kernel appears to work alright, but later on when you actually start using the memory the kernel notices "Oh crap, I'm running out of memory", invokes the out-of-memory (OOM) killer which selects some victim process and kills it.
For a description of this behavior, see http://lwn.net/Articles/104185/
It could be some code using no-throw new and not checking the return value.
Or some code could be catching the exception and not handling it or rethrowing it.
What janneb said. In fact Linux by default never throws std::bad_alloc (or returns NULL from malloc()).
I'm working on a multithreaded C++ application that is corrupting the heap. The usual tools to locate this corruption seem to be inapplicable. Old builds (18 months old) of the source code exhibit the same behaviour as the most recent release, so this has been around for a long time and just wasn't noticed; on the downside, source deltas can't be used to identify when the bug was introduced - there are a lot of code changes in the repository.
The prompt for crashing behaviuor is to generate throughput in this system - socket transfer of data which is munged into an internal representation. I have a set of test data that will periodically cause the app to exception (various places, various causes - including heap alloc failing, thus: heap corruption).
The behaviour seems related to CPU power or memory bandwidth; the more of each the machine has, the easier it is to crash. Disabling a hyper-threading core or a dual-core core reduces the rate of (but does not eliminate) corruption. This suggests a timing related issue.
Now here's the rub:
When it's run under a lightweight debug environment (say Visual Studio 98 / AKA MSVC6) the heap corruption is reasonably easy to reproduce - ten or fifteen minutes pass before something fails horrendously and exceptions, like an alloc; when running under a sophisticated debug environment (Rational Purify, VS2008/MSVC9 or even Microsoft Application Verifier) the system becomes memory-speed bound and doesn't crash (Memory-bound: CPU is not getting above 50%, disk light is not on, the program's going as fast it can, box consuming 1.3G of 2G of RAM). So, I've got a choice between being able to reproduce the problem (but not identify the cause) or being able to idenify the cause or a problem I can't reproduce.
My current best guesses as to where to next is:
Get an insanely grunty box (to replace the current dev box: 2Gb RAM in an E6550 Core2 Duo); this will make it possible to repro the crash causing mis-behaviour when running under a powerful debug environment; or
Rewrite operators new and delete to use VirtualAlloc and VirtualProtect to mark memory as read-only as soon as it's done with. Run under MSVC6 and have the OS catch the bad-guy who's writing to freed memory. Yes, this is a sign of desperation: who the hell rewrites new and delete?! I wonder if this is going to make it as slow as under Purify et al.
And, no: Shipping with Purify instrumentation built in is not an option.
A colleague just walked past and asked "Stack Overflow? Are we getting stack overflows now?!?"
And now, the question: How do I locate the heap corruptor?
Update: balancing new[] and delete[] seems to have gotten a long way towards solving the problem. Instead of 15mins, the app now goes about two hours before crashing. Not there yet. Any further suggestions? The heap corruption persists.
Update: a release build under Visual Studio 2008 seems dramatically better; current suspicion rests on the STL implementation that ships with VS98.
Reproduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.
I'll take a note of that, but I'm concerned that Dr Watson will only be tripped up after the fact, not when the heap is getting stomped on.
Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.
Got that going at the moment, again: not much help until something goes wrong. I want to catch the vandal in the act.
Maybe these tools will allow you at least to narrow the problem to certain component.
I don't hold much hope, but desperate times call for...
And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?
No I'm not, and I'll spend a couple of hours tomorrow going through the workspace (58 projects in it) and checking they're all compiling and linking with the appropriate flags.
Update: This took 30 seconds. Select all projects in the Settings dialog, unselect until you find the project(s) that don't have the right settings (they all had the right settings).
My first choice would be a dedicated heap tool such as pageheap.exe.
Rewriting new and delete might be useful, but that doesn't catch the allocs committed by lower-level code. If this is what you want, better to Detour the low-level alloc APIs using Microsoft Detours.
Also sanity checks such as: verify your run-time libraries match (release vs. debug, multi-threaded vs. single-threaded, dll vs. static lib), look for bad deletes (eg, delete where delete [] should have been used), make sure you're not mixing and matching your allocs.
Also try selectively turning off threads and see when/if the problem goes away.
What does the call stack etc look like at the time of the first exception?
I have same problems in my work (we also use VC6 sometimes). And there is no easy solution for it. I have only some hints:
Try with automatic crash dumps on production machine (see Process Dumper). My experience says Dr. Watson is not perfect for dumping.
Remove all catch(...) from your code. They often hide serious memory exceptions.
Check Advanced Windows Debugging - there are lots of great tips for problems like yours. I recomend this with all my heart.
If you use STL try STLPort and checked builds. Invalid iterator are hell.
Good luck. Problems like yours take us months to solve. Be ready for this...
We've had pretty good luck by writing our own malloc and free functions. In production, they just call the standard malloc and free, but in debug, they can do whatever you want. We also have a simple base class that does nothing but override the new and delete operators to use these functions, then any class you write can simply inherit from that class. If you have a ton of code, it may be a big job to replace calls to malloc and free to the new malloc and free (don't forget realloc!), but in the long run it's very helpful.
In Steve Maguire's book Writing Solid Code (highly recommended), there are examples of debug stuff that you can do in these routines, like:
Keep track of allocations to find leaks
Allocate more memory than necessary and put markers at the beginning and end of memory -- during the free routine, you can ensure these markers are still there
memset the memory with a marker on allocation (to find usage of uninitialized memory) and on free (to find usage of free'd memory)
Another good idea is to never use things like strcpy, strcat, or sprintf -- always use strncpy, strncat, and snprintf. We've written our own versions of these as well, to make sure we don't write off the end of a buffer, and these have caught lots of problems too.
Run the original application with ADplus -crash -pn appnename.exe
When the memory issue pops-up you will get a nice big dump.
You can analyze the dump to figure what memory location was corrupted.
If you are lucky the overwrite memory is a unique string you can figure out where it came from. If you are not lucky, you will need to dig into win32 heap and figure what was the orignal memory characteristics. (heap -x might help)
After you know what was messed-up, you can narrow appverifier usage with special heap settings. i.e. you can specify what DLL you monitor, or what allocation size to monitor.
Hopefully this will speedup the monitoring enough to catch the culprit.
In my experience, I never needed full heap verifier mode, but I spent a lot of time analyzing the crash dump(s) and browsing sources.
P.S:
You can use DebugDiag to analyze the dumps.
It can point out the DLL owning the corrupted heap, and give you other usefull details.
You should attack this problem with both runtime and static analysis.
For static analysis consider compiling with PREfast (cl.exe /analyze). It detects mismatched delete and delete[], buffer overruns and a host of other problems. Be prepared, though, to wade through many kilobytes of L6 warning, especially if your project still has L4 not fixed.
PREfast is available with Visual Studio Team System and, apparently, as part of Windows SDK.
Is this in low memory conditions? If so it might be that new is returning NULL rather than throwing std::bad_alloc. Older VC++ compilers didn't properly implement this. There is an article about Legacy memory allocation failures crashing STL apps built with VC6.
The apparent randomness of the memory corruption sounds very much like a thread synchronization issue - a bug is reproduced depending on machine speed. If objects (chuncks of memory) are shared among threads and synchronization (critical section, mutex, semaphore, other) primitives are not on per-class (per-object, per-class) basis, then it is possible to come to a situation where class (chunk of memory) is deleted / freed while in use, or used after deleted / freed.
As a test for that, you could add synchronization primitives to each class and method. This will make your code slower because many objects will have to wait for each other, but if this eliminates the heap corruption, your heap-corruption problem will become a code optimization one.
You tried old builds, but is there a reason you can't keep going further back in the repository history and seeing exactly when the bug was introduced?
Otherwise, I would suggest adding simple logging of some kind to help track down the problem, though I am at a loss of what specifically you might want to log.
If you can find out what exactly CAN cause this problem, via google and documentation of the exceptions you are getting, maybe that will give further insight on what to look for in the code.
My first action would be as follows:
Build the binaries in "Release" version but creating debug info file (you will find this possibility in project settings).
Use Dr Watson as a defualt debugger (DrWtsn32 -I) on a machine on which you want to reproduce the problem.
Repdroduce the problem. Dr Watson will produce a dump that might be helpful in further analysis.
Another try might be using WinDebug as a debugging tool which is quite powerful being at the same time also lightweight.
Maybe these tools will allow you at least to narrow the problem to certain component.
And are you sure that all the components of the project have correct runtime library settings (C/C++ tab, Code Generation category in VS 6.0 project settings)?
So from the limited information you have, this can be a combination of one or more things:
Bad heap usage, i.e., double frees, read after free, write after free, setting the HEAP_NO_SERIALIZE flag with allocs and frees from multiple threads on the same heap
Out of memory
Bad code (i.e., buffer overflows, buffer underflows, etc.)
"Timing" issues
If it's at all the first two but not the last, you should have caught it by now with either pageheap.exe.
Which most likely means it is due to how the code is accessing shared memory. Unfortunately, tracking that down is going to be rather painful. Unsynchronized access to shared memory often manifests as weird "timing" issues. Things like not using acquire/release semantics for synchronizing access to shared memory with a flag, not using locks appropriately, etc.
At the very least, it would help to be able to track allocations somehow, as was suggested earlier. At least then you can view what actually happened up until the heap corruption and attempt to diagnose from that.
Also, if you can easily redirect allocations to multiple heaps, you might want to try that to see if that either fixes the problem or results in more reproduceable buggy behavior.
When you were testing with VS2008, did you run with HeapVerifier with Conserve Memory set to Yes? That might reduce the performance impact of the heap allocator. (Plus, you have to run with it Debug->Start with Application Verifier, but you may already know that.)
You can also try debugging with Windbg and various uses of the !heap command.
MSN
Graeme's suggestion of custom malloc/free is a good idea. See if you can characterize some pattern about the corruption to give you a handle to leverage.
For example, if it is always in a block of the same size (say 64 bytes) then change your malloc/free pair to always allocate 64 byte chunks in their own page. When you free a 64 byte chunk then set the memory protection bits on that page to prevent reads and wites (using VirtualQuery). Then anyone attempting to access this memory will generate an exception rather than corrupting the heap.
This does assume that the number of outstanding 64 byte chunks is only moderate or you have a lot of memory to burn in the box!
If you choose to rewrite new/delete, I have done this and have simple source code at:
http://gandolf.homelinux.org/~smhanov/blog/?id=10
This catches memory leaks and also inserts guard data before and after the memory block to capture heap corruption. You can just integrate with it by putting #include "debug.h" at the top of every CPP file, and defining DEBUG and DEBUG_MEM.
The little time I had to solve a similar problem.
If the problem still exists I suggest you do this :
Monitor all calls to new/delete and malloc/calloc/realloc/free.
I make single DLL exporting a function for register all calls. This function receive parameter for identifying your code source, pointer to allocated area and type of call saving this information in a table.
All allocated/freed pair is eliminated. At the end or after you need you make a call to an other function for create report for left data.
With this you can identify wrong calls (new/free or malloc/delete) or missing.
If have any case of buffer overwritten in your code the information saved can be wrong but each test may detect/discover/include a solution of failure identified. Many runs to help identify the errors.
Good luck.
Do you think this is a race condition? Are multiple threads sharing one heap? Can you give each thread a private heap with HeapCreate, then they can run fast with HEAP_NO_SERIALIZE. Otherwise, a heap should be thread safe, if you're using the multi-threaded version of the system libraries.
A couple of suggestions. You mention the copious warnings at W4 - I would suggest taking the time to fix your code to compile cleanly at warning level 4 - this will go a long way to preventing subtle hard to find bugs.
Second - for the /analyze switch - it does indeed generate copious warnings. To use this switch in my own project, what I did was to create a new header file that used #pragma warning to turn off all the additional warnings generated by /analyze. Then further down in the file, I turn on only those warnings I care about. Then use the /FI compiler switch to force this header file to be included first in all your compilation units. This should allow you to use the /analyze switch while controling the output