This is an extension of my previous question, Application crash with no explanation.
I have a lot of crashes that are presumably caused by heap corruption on an application server. These crashes only occur in production; they cannot be reproduced in a test environment.
I'm looking for a way to track down these crashes.
Application Verifier was suggested, and it would be fine, but it's unusable with our production server. When we try to start it in production with application verifier, it becomes so slow that it's completely unusable, even though this is a fairly powerful server (64-bit application, 16 GB memory, 8 processors). Running it without application verifier, it only uses about 1 GB of memory and no more than 10-15% of any processor's cycles.
Are there any other tools that will help find heap corruption, without adding a huge overhead?
Use the debug version of the Microsoft runtime libraries. Turn on red-zoning and get your heap automatically checked every 128 (say) heap operations by calling _CrtSetDbgFlag() once during initialisation.
_CRTDBG_DELAY_FREE_MEM_DF can be quite useful for finding memory-used-after-free bugs, but your heap size grows monitonically while using it.
Would there be any benefit in running it virtualized and taking scheduled snapshots, so that you hopefully can get a snapshot just a little before it actually crashes? Then take the pre-crash snapshot and start it in a lab environment. If you can get it to crash again there, restart the snapshot and start inspecting your server process.
Mudflap with GCC. It does code instrumentation for production code.
You have to compile your soft with -fmudflap. It will check any wrong pointer access (heap/stack/static). It is designed to work for production code with a little slowdown (between x1.5 to x5). You can also disable check at read access for speedup.
Related
I have a C++ program running on Windows 10. The windows task manager tells me that the commit size is increasing rapidly over time. While the working set appears to be constant.
Screenshot of Task Manager, my program in first line, commit size ~37GB
The code has been checked for memory leaks many times by different developers, we can't find any obvious leak.
The program is a graphics and memory intense application, utilizing MFC to instantiate multiple windows. Which we use to render into with OpenGL. There is a lot copying of data going on at runtime, because we are processing images of multiple cameras.
The issue is that after ~10-15days, when the commit size exhausts the available total (paged included) memory of the system (not physical RAM), either:
a) the program will crash to desktop
b) the display driver disconnected from the GPU. And we are greeted with just black screens.
What I have tried so far:
finding memory leaks in the code
updating graphics driver
updating windows 10
What kind of leak could cause only the commit size to increase? How can I prevent this issue from happening?
After much testing and code reviewing, I have found that there was no memory leak in my application after all. Windows reported it as such, however, after upgrading the Windows 10 build from 1909 to 21H2, all the issues went away. No more commit memory increase and no more crashing/black screening.
What, MFC is still around? Unless you have a real need for MFC you should probably use something like glfw.
Past that, outside of using external memory leak detectors you can use the heap debug support in Visual Studio's c runtime. You will need to define the following in your code.
#define _CRTDBG_MAP_ALLOC
#include <stdlib.h>
#include <crtdbg.h>
and add
_CrtDumpMemoryLeaks();
somewhere after the major processing and before your program exits.
This will swap out the normal malloc and free with debug versions. You can also modify new/delete to get more information as well. These overrides should catch most of the standard library allocators. There is a lot of info in that link (more than should probably be here)
After swapping out the memory handlers you would then run your program and after it has leaked a bit shut it down. Be careful to shut it down gracefully (don't kill the process) otherwise a lot more than the actual leaks will be seen as leaks. Once you call _CrtDumpMemoryLeaks it will then dump information on everything that was not free'd. Sometimes the leaks can be a hard to find as the dump may not include the source/lines (especially if you do not override new/deletes). It will just be allocation index, size, and data information, but the docs tell you how to track it down with allocation breakpoints. By default the info will dump to VS's output window, so in general it is better to run the app from VS. There are other apps that hook the OutputDebugMessage call and you can also redirect it yourself, but again VS is the easiest. You will need to run debug builds in order for it to work. The debug versions of the allocators are going to be slower and increase the size of your allocations but its a small price to pay in this case (and its temporary).
If you still have issues locating the leaks then next thing to try is GFlags - PageHeap , but that is a whole other topic and things are really going wrong at that point.
I'm using CentOS 7 and I'm running a C++ Application. Recently I switched to a newer version of a library which the application was using for various MySQL C API functions. But after integrating the new library, I saw a tremendous increase in memory usage of the program i.e. the application crashes if left running for more than a day or two. Precisely, what happens is the memory usage for the application starts increasing upto a point where the application alone is using 74.9% of total memory of the system and then it is forcefully shut down by the system.
Is there any way of how to track memory usage of the whole application including the static variables as well. I've already tried valgrind's tool Massif.
Can anyone tell me what could be the possible reasons for the increased memory usage or any tools that can give me a deep insight of how the memory is being allocated (both static and dynamic). Is there any tool which can tell us about Memory Allocation for a C++ Application running in a linux environment?
Thanks in advance!
Static memory is allocate when the program starts. Are you seeing memory growth or a startup increase?
Since it takes 'a day or two to crash', the trouble is likely a memory leak or unbounded growth of a data structure. Valgrind should be able to help with both. If valgrind shows a big leak with the --leak-check-full option then you will likely have found the issue.
To check for unbounded growth, put a preemptive _exit() in the program at a point where you suspect the heap has grown. For example, put a timer on the main loop and have the program _exit after 10 minutes. If the valgrind shows a big 'in use at exit' then you likely have unbounded growth of a data structure but not a leak. Massif can help track this down. The ms_print gives details of allocations with function stack.
If you find an issue, try switching back to the older version of your library. If the problem goes away, check and make sure you are using the API properly in the new version. If you don't have the source code then you are a bit stuck in terms of a fix.
If you want to go the extra mile, you can write a shared library interposer for malloc/free to see what is happening. Here is a good start. Linux has the backtrace functionality that can help with determining the exact stack.
Finally, if you must use the 3rd party library and find the heap growing without bound or leaking then you can use the shared library interposer to directly call free/delete. This is a risky last-ditch unrecommended strategy but I've used in production to limp a process along.
I'm having an issue with my application that I only observe while I use Valgrind.
My program involves a large simulation. When I unload the simulation portion of the program, while monitoring for errors with Valgrind it results in a permanent slowdown in the application. I would have expected the opposite as unloading basically leaves my application with very little to do... Valgrind reports no errors. This slowdown does not occur (or is not observable) when I don't use Valgrind.
I have tried benchmarking various portions of my application using timers and they all seem to slow down fairly evenly across the board. My application also contains multiple asyncronous threads that all slow down. Processor usage does not seem to increase when viewed through the system monitor...
I'll note that I'm using openGL with fglrx drivers which are known to have some issues with Valgrind.
Should I be concerned about this even though it only occurs with Valgrind? Is it likely that this slow down in caused by freeing a large amount of data while using Valgrind, or is it mostly likely indicative of serious bug in my code?
Basically I am trying to ascertain if this is entirely dependent on Valgrind usage or if Valgrind usage is amplifying the consequences of a bug in my code that otherwise is symptom free (but may later cause me problems).
I have a C++ process running in Solaris which creates 3 threads to do some tasks.
These threads execute in loops and it runs as long as the process is running.
But, I see that the memory usage of the process grows continuously and the process core dumps once the memory usage exceeds 4GB.
Can someone give me some pointers on what could be the issue behind memory usage growth?
What can I do to prevent process from core dumping because of memory exhaustion?
Will thread restart help?
Any pointers welcome.
No, restarting a thread would not help.
It seems like you have a memory leak in your application.
In my experience there are two types of memory leaks:
real memory leaks that you can see when the application exits
'false' memory leaks, like a big list that increases during the lifetime of your application but which is correctly cleaned up at the end
For the first type, there are tools which can report the memory that has not been freed by your application when it exits. I don't know about Solaris but there are numerous tools under Windows which can do that. For Unix, I think that Valgrind does this.
For the second type, there are also tools under Windows that can take snapshots of the memory of your application. Simply take two snapshots with an interval of a few minutes or hours (depending on your application) and let them compare by the tool. There are probably simlar tools like this on Solaris.
Using these tools will probably require your application to take much more memory, since the tool needs to store the call stack of every memory allocation. Because of this it will also run much slower. However, you will only see this effect when you are actively using this tool, so there is no effect in real-life production code.
So, just look for this kind of tools under Solaris. I quickly Googled for it and found this link: http://prefetch.net/blog/index.php/2006/02/19/finding-memory-leaks-on-solaris-systems/. This could be a starting point.
EDIT: Some additional information: are you looking at the right kind of memory? Even if you only allocated 3GB in total, the total virtual address space may still reach 4GB because of memory fragmentation. Unfortunately, there is nothing you can do about this (except using another memory allocation strategy).
Guys, could you please recommend a tool for spotting a memory corruption on a production multithreaded server built with c++ and working under linux x86_64? I'm currently facing the following problem : every several hours my server crashes with a segfault and the core dump shows that error happens in malloc/calloc which is definitely a sign of memory being corrupted somewhere.
Actually I have already tried some tools without much luck. Here is my experience so far:
Valgrind is a great(I'd even say best) tool but it slows down the server too much making it unusable in production. I tried it on a stage server and it really helped me find some memory related issues but even after fixing them I still get crashes on the production server. I ran my stage server under Valgrind for several hours but still couldn't spot any serious errors.
ElectricFence is said to be a real memory hog but I couldn't even get it working properly. It segfaults almost immediately on the stage server in random weird places where Valgrind didn't show any issues at all. Maybe ElectricFence doesn't support threading well?.. I have no idea.
DUMA - same story as ElectricFence but even worse. While EF produced core dumps with readable backtraces DUMA shows me only "?????"(and yes server is built with -g flag for sure)
dmalloc - I configured the server to use it instead of standard malloc routines however it hangs after several minutes. Attaching a gdb to the process reveals it's hung somewhere in dmalloc :(
I'm gradually getting crazy and simply don't know what to do next. I have the following tools to be tried: mtrace, mpatrol but maybe someone has a better idea?
I'd greatly appreciate any help on this issue.
Update: I managed to find the source of the bug. However I found it on the stage server not production one using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...
Yes, C/C++ memory corruption problems are tough.
I also used several times valgrind, sometimes it revealed the problem and sometimes not.
While examining valgrind output don't tend to ignore its result too fast. Sometimes after a considerable time spent, you'll see that valgrind gave you the clue on the first place, but you ignored it.
Another advise is to compare the code changes from previously known stable release. It's not a problem if you use some sort of source versioning system (e.g. svn). Examine all memory related functions (e.g. memcpy, memset, sprintf, new, delete/delete[]).
Compile your program with gcc 4.1 and the -fstack-protector-all switch. If the memory corruption is caused by stack smashing this should be able to detect it. You might need to play with some of the additional parameters of SSP.
Folks, I managed to find the source of the bug. However I found it on the stage server using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...
Have you tried -fmudflap? (scroll up a few lines to see the options available).
you can try IBM purify, but i am afraid that is not opensource..
The Google Perftools --- which is Open Source --- may be of help, see the heap checker documentation.
Try this one:
http://www.hexco.de/rmdebug/
I used it extensively, its has a low impact in performance(it mostly impacts amount of ram) but the allocation algorithm is the same. Its always proven enough to find any allocation bugs. Your program will crash as soon as the bug occurs, and it will have a detailed log.
I'm not sure if it would have caught your particular bug, but the MALLOC_CHECK_ environment variable (malloc man page) turns on additional checking in the default Linux malloc implementation, and typically doesn't have a significant runtime cost.