find memory leaks when Valgrind shows nothing - c++

I am looking for memory leaks in a C++ program on Linux with a heavy legacy background (multi-threaded, using libstdc++ containers). This program is a proxy server, an intermediary for requests from clients to servers.
Valgrind has detected a few ones that are now fixed, and shows nothing more.
But the RSS of the process (resident memory as shown by /proc//stat) still grows on a given repeated stimulus (around 9 bytes per iteration). This is not linear and grows by big steps, probably because the lib c++ containers does memory optimization and the RSS is measured by pages that have a size of 4096 bytes).
As Valgrind finds nothing, I may suspect some recursive calls that grow the stack or some unused and forgotten tables (eg: std::list, std::map, std::string, etc.) that keep growing.
The only methods I see for my search are:
Reading the code;
Reduce the scope by deactivating parts of the code;
But these are laborious and time consuming.
How could I improve my search? Are there tools for finding growing stacks or tables?
Any other idea about the cause of the leak (except dangling pointers, uncontrolled recursion, growing tables)?

Use https://github.com/vmware/chap :
To do this, gather a live core of your process (after running an hour's worth of iterations) then start chap with the path of that core as the only argument. From the chap prompt, try the following:
count used
count free
count leaked
count writable
Assuming reported for used is signicantly larger than the number reported for free, the next thing you want to check is the number for leaked. If that number is non-zero, you actually have leaks in the sense of memory that is no longer referenced. Follow USERGUIDE.md for some strategies for analyzing that.
If the number for leaked is 0 or not significant, but the number for used is you likely have some container growth. Use summarize used as the next step.

Related

How to iterate all malloc chunks (glibc)

I'm trying to iterate all the malloc_chunk in all arenas. (debugging based on core file, for memory leak and memory corruption investigation)
As i know each arena have top_chunk which point to the top chunk inside of one arena, based on top_chunk, inside of it, there's prev_size and size, based on the code (glibc/malloc/malloc.c):
I can get the previous continuous chunks, and then loop all the chunks in one arena. (i can statistic the chunks with the size and the number, which like WinDBG: !heap -stat -h) and also based on prev_size and size, i can check the chunk is corrupt or not.
In arena(malloc_state), there's a member variable: next which point to next arena. Then i can loop all the arena's chunks.
But i met a problem is if the chunk is not allocated, the prev_size is invalid, how to get the previous malloc_chunk?? Or this way is not correct.
Question Background:
The memory leak bug we have is memory leak reported in several online data node(our project is distributed storage cluster).
What we did and result:
We use valrgind to reproduce the bug in test cluster, but unfortunately we get nothing.
I tried to investigate more about the heap, tried to analyze the heap chunk and follow the way which i did before in WinDBG(which have very interesting heap commands to digger the memory leak and memory corruption), but i was blocked by the Question which i asked.
We use valgrind-massif to analyze the allocation(which i think it's very detail and interesting, could show which allocation takes how much memory). Massif show several clues, we follow this and check code, finally found the leak(a map is very huge, and in-proper usage of it, but i would erase in holder-class's destructor, that's why valgrind not report this).
I'll digger more about the gdb-heap source code to know more about glic malloc structure.
The free open source program https://github.com/vmware/chap does what you want here for glibc malloc. Just grab a core (either because the core crashed or grab a lib core by using gcore or using the generate command from within gdb). Then just open the core by doing:
chap yourCoreFileName
Once you get to the chap prompt, if you want to iterate through all the chunks, both free and not, you can do any of the following, depending on the verbosity you want, but keeping in mind that an "allocation" in chap does not contain the chunk header, but rather starts at the address returned by malloc.
Try any of the following:
count allocations
summarize allocations
describe allocations
show allocations
If you only care about allocations that are currently in use try any of the following:
count used
summarize used
describe used
show used
If you only care about allocations that are leaked try any of the following:
count leaked
summarize leaked
describe leaked
show leaked
More details are available in documentation available from the github URL mentioned above.
In terms of corruption, chap does some checking at startup and reports many kinds of corruption, although the output may be a bit cryptic at times.
First, before digging into the implementation details of malloc, your time may be better spent with a tool like valgrind or even run under the MALLOC_CHECK_ environment variable to let the internal heap consistency checking do the work for you.
But, since you asked....
glibc's malloc.c has some helpful comments about looking at the previous chunk.
Some particularly interesting ones are:
/* Note that we cannot even look at prev unless it is not inuse */
And:
If prev_inuse is set for any given chunk, then you CANNOT determine the size of the previous chunk, and might even get a memory addressing fault when trying to do so.
This is just a limitation of the malloc implimentation. When a previous chunk is in use, the footer that would store the size is used by the user-data of the allocation instead.
While it doesn't help your case, you can check whether a previous chunk is in use by following what the prev_inuse macro does.
#define PREV_INUSE 0x1
#define prev_inuse(p) ((p)->size & PREV_INUSE)
It checks the low-order bit of the current chunk's size. (All chunk sizes are divisible by 4 so the lower 2 bits can be used for status.) That would help you stop your iteration before going off into no-man's land.
Unfortunately, you'd still be terminating your loop early, before visiting every chunk.
If you really want to iterate over all chunks, I'd recommend that you start at malloc_state::top and follow the next_chunk until next_chunk points to the top.
Try pmap <PID> -XX command to trace down the memory usage from different aspects.

Memory leak that doesn't crash when OOM, or show up in massif/valgrind

I have an internal C++ application that will indefinitely grow--so much so that we've had to implement logic that actually kills it once the RSS reaches a certain peak size (2.0G) just to maintain some semblance of order. However, this has shown some strange behaviors.
First, I ran the application through Valgrind w/ memcheck, and fixed some random memory leaks here and there. However, the extent of these memory leaks were measured in the 10s of megabytes. This makes sense, as it could be that there's no actual memory leaking--it could just be poor memory management on the application side.
Next, I used Valgrind w/ massif to check to see where the memory is going, and this is where it gets strange. The peak snapshot is 161M--nowhere near the 1.9G+ peaks we see using the RSS field. The largest consumption is where I'd expect--in std::string--but this is not abnormal.
Finally, and this is the most puzzling--before we were aware of this memory leak, I actually was testing this service on AWS, and just for fun, set the number of workers to a high number on a CC2.8XL machine, 44 workers. That's 60.5G of RAM, and no swap. Fast forward a month: I go to look at the host--and low and behold, it's maxed out on RAM--BUT! The processes are still running fine, and are stuck at varying stages of memory usage--almost evenly distributed from 800M to 1.9G. Every once in a while dmesg prints out an Xen error about being unable to allocate memory, but other than that, the processes never die and continue to actively process (i.e., they're not "stuck").
Is there something I'm missing here? It's basically working, but for the life of me, I can't figure out why. What would be a good recommendation on what to look for next? Are there any tools that might help me figure it out?
Note that valgrind memcheck only discovers when you "abandon" memory. while(1) vec.push_back(n++); will fill all available memory but not report any leaks. By the sounds of things, you are collecting strings somewhere that take up a lot of space. I have also worked on code that uses a lot of memory but not really leaking it [it's all in various places that valgrind is happy is not a leak!]. Sometimes you can track it down by simply adding some markers to the memory allocations, or some such, to indicate WHERE you are allocating memory.
In std:: functions, there is typically an Allocator argument. If you implement several different pools of memory, you may find where you are allocating memory.
I have also seen cases where I think that the process is having it's memory fragmented, so there are lots of little free spaces in the heap - this can happen if, for example, you create a lot of strings by adding to the size of the string.
If it's a issue of fragmentation, run valgrind massif with the --pages-as-heap=yes option may confirm whether if it's fragmentation.

how to hunt down memory leak valgrind says doesn't exist?

I have a program that accepts data from a socket, does some quality control and assorted other conditioning to it, then writes it out to a named pipe. I ran valgrind on it and fixed all the memory leaks that originally existed. I then created a 'demo' environment on a system where I had 32 instances of this program running, each being fed unique data and each outputting to it's own pipe. We tested it and everything looked to be fine. Then I tried stress testing it by boosting the rate at which data is sent in to an absurd rate and things looked to be good at first...but my programs kept consuming more and more memory until I had no resources left.
I turned to valgrind and ran the exact same setup except with each program running inside valgrind using leak-check=full. A few odd things happened. First, the memory did leak, but only to the point where each program had consumed .9 % of my memory (previously the largest memory hog had a full 6% of my memory). With valgrind running the CPU cost of the programs shot up and I was now at 100% cpu with a huge load-average, so it's possible the lack of available CPU caused the programs to all run slowly enough that the leak took too long to manifest. When I tried stopping these programs valgrind showed no direct memory leaks, it showed some potential memory leaks but I checked them and I don't think any of them represent real memory leaks; and besides which the possible memory leak only showed as a few kilobytes while the program was consuming over 100 MB. The reachable (non-leaked) memory reported by valgrind was also in the KB range, so valgrind seems to believe that my programs are consuming a fraction of the memory that Top says they are using.
I've run a few other tests and got odd results. A single program, even running at triple the rate my original memory-leak was detected at, never seems to consume more than .9% memory, two programs leak up to 1.9 and 1.3% memory respectively but no more etc, it's as if the amount of memory leaked, and the rate at which it leaks, is somehow dependent on how many instances of my program are running at one time; which makes no sense, each instance should be 100% independent of the others.
I also found if I run 32 instances with only one instance running in valgrind the valgrinded instance (that's a word if I say it is!) leaks memory, but at a slower rate than the ones running outside of valgrind. The valgrind instance will still say I have no direct leaks and reports far less memory consumption then Top shows.
I'm rather stumped as to what could be causing this result, and why valgrind refuses to be aware of the memory leak. I thought it might be an outside library, but I don't really use any external libraries; just basic C++ functions/objects. I also considered it could be the data written to the output pipe to fast causing the buffer to grow indefinitely, but 1) there should be an upper limit that such a buffer can grow and 2) once memory has been leaked if I drop the data input rate to nothing the memory stays consumed rather then slowly dropping back to a reasonable amount.
Can anyone give me a hint as to where I should look from here? I'm totally stumped as to why the memory is behaving this way.
Thanks.
This sounds like a problem I had recently.
If your program accepts data and buffers it internally without any limits, then it may be reading and buffering faster than it can output the data. In that case, memory use will continue to increase without limit.
The more instances of the program that you run, the slower each instance will go, and the faster the buffers will increase.
This may or may not be your problem, but without more information it is the best I can do.
You should first look for soft leak. It happens when some static or singleton gradually increases some buffer or container and collects trash into it. Technically it is not leak but its effects are as bad.
May I suggest you give a try with MemoryScape? This tool does a pretty good job in memory leak detection. It's not free but given the time and energy spent, it is worth trying.

How does a memory leak improve performance

I'm building a large RTree (spatial index) full of nodes. It needs to be able to handle many queries AND updates. Objects are continuously being created and destroyed. The basic test I'm running is to see the performance of the tree as the number of objects in the tree increases. I insert from 100-20000 uniformly size, randomly located objects in increments of 100. Searching and updating are irrelevant to the issue I am currently faced with.
Now, when there is NO memory leak the "insert into tree" performance is everywhere. It goes anywhere from 10.5 seconds with ~15000 objects to 1.5 with ~18000. There is no pattern whatsoever.
When I deliberately add in a leak, as simple as putting in "new int;" I don't assign it to anything, that right there is a line to itself, the performance instantly falls onto a nice gentle curve sloping from 0 (roughly) seconds for 100 objects to 1.5 for the full 20k.
Very, very lost at this point. If you want source code I can include it but it's huuugggeeee and literally the only line that makes a difference is "new int;"
Thanks in advance!
-nick
I'm not sure how you came up with this new int test, but it's not very good way to fix things :) Run your code using a profiler and find out where the real delays are. Then concentrate on fixing the hot spots.
g++ has it built in - just compile with -pg
Without more information it's impossible to be sure.
However I wonder if this is to do with heap fragmentation. By creating a freeing many blocks of memory you'll likely be creating a whole load of small fragments of memory linked together.The memory manager needs to keep track of them all so it can allocate them again if needed.
Some memory managers when you free a block try to "merge" it with surrounding blocks of memory and on a highly fragmented heap this can be very slow as it tries to find the surrounding blocks. Not only this, but if you have limited physical memory it can "touch " many physical pages of memory as it follows the chain of memory blocks which can cause a whole load of extremely slow page faults which will be very variable in speed depending on exactly how much physical memory the OS decides to give that process.
By leaving some un-freed memory you will be changing this pattern of access which might make a large difference to the speed. You might for example be forcing the run time library to allocate new block of memory each time rather than having to track down a suitably sized existing block to reuse.
I have no evidence this is the case in your program, but I do know that memory fragmentation is often the causes of slow programs when a lot of new and free is performed.
The possible thing that is happening which explains this (a theory)
The compiler did not remove the empty new int
The new int is in one of the inner loops or somewhere in your recursive traversal wherein it gets executed the most amount of time
The overall RSS of the process increases and eventually the total memory being used by the process
There are page faults happening because of this
Because of the page-faults, the process becomes I/O bound instead of being CPU bound
End result, you see a drop in the throughput. It will help if you can mention the compiler being used and the options for the compiler that you are using to build the code.
I am taking a stab in the dark here but the problem could be the way the heap gets fragmented. You said that you are creating a destroying large numbers of objects. I will assume that the objects are all of different size.
When one allocates memory on the heap, a cell the size needed is broken off from the heap. When the memory is freed, the cell is added to a freelist. When one does a new alloc, the allocator walks the heap until a cell that is big enough is found. When doing large numbers of allocations, the free list can get rather long and walking the list can take a non-trivial amount of time.
Now an int is rather small. So when you do your new int, it may well eat up all the small heap cells on the free list and thus dramatically speed up larger allocations.
The chances are, however that you are allocating and freeing similar sized objects. If you use your own freelists, you will safe yourself many heap walks and may dramatically improve performance. This is exactly what the STL allocators do to improve performance.
Solution: Do not run from Visual Studio. Actually run the .exe file. Figured this out because that's what the profilers were doing and the numbers were magically dropping. Checked memory usage and version running (and giving me EXCEPTIONAL times) was not blowing up to excessively huge sizes.
Solution to why the hell Visual Studio does ridiculous crap like this: No clue.

Information about PTE's (Page Table Entries) in Windows

In order to find more easily buffer overflows I am changing our custom memory allocator so that it allocates a full 4KB page instead of only the wanted number of bytes. Then I change the page protection and size so that if the caller writes before or after its allocated piece of memory, the application immediately crashes.
Problem is that although I have enough memory, the application never starts up completely because it runs out of memory. This has two causes:
since every allocation needs 4 KB, we probably reach the 2 GB limit very soon. This problem could be solved if I would make a 64-bit executable (didn't try it yet).
even when I only need a few hundreds of megabytes, the allocations fail at a certain moment.
The second problem is the biggest one, and I think it's related to the maximum number of PTE's (page table entries, which store information on how Virtual Memory is mapped to physical memory, and whether pages should be read-only or not) you can have in a process.
My questions (or a cry-for-tips):
Where can I find information about the maximum number of PTE's in a process?
Is this different (higher) for 64-bit systems/applications or not?
Can the number of PTE's be configured in the application or in Windows?
Thanks,
Patrick
PS. note for those who will try to argument that you shouldn't write your own memory manager:
My application is rather specific so I really want full control over memory management (can't give any more details)
Last week we had a memory overwrite which we couldn't find using the standard C++ allocator and the debugging functionality of the C/C++ run time (it only said "block corrupt" minutes after the actual corruption")
We also tried standard Windows utilities (like GFLAGS, ...) but they slowed down the application by a factor of 100, and couldn't find the exact position of the overwrite either
We also tried the "Full Page Heap" functionality of Application Verifier, but then the application doesn't start up either (probably also running out of PTE's)
There is what i thought was a great series of blog posts by Mark Russinovich on technet called "Pushing the limits of Windows..."
http://blogs.technet.com/markrussinovich/archive/2008/07/21/3092070.aspx
It has a few articles on virtual memory, paged nonpaged memory, physical memory and others.
He mentions little utilities he uses to take measurements about a systems resources.
Hopefully you will find your answers there.
A shotgun approach is to allocate those isolated 4KB entries at random. This means that you will need to rerun the same tests, with the same input repeatedly. Sometimes it will catch the error, if you're lucky.
A slightly smarter approach is to use another algorithm than just random - e.g. make it dependent on the call stack whether an allocation is isolated. Do you trust std::string users, for instance, and suspect raw malloc use?
Take a look at the implementation of OpenBSD malloc. Much of the same ideas (and more) implemented by very skilled folk.
In order to find more easily buffer
overflows I am changing our custom
memory allocator so that it allocates
a full 4KB page instead of only the
wanted number of bytes.
This has already been done. Application Verifier with PageHeap.
Info on PTEs and the Memory architecture can be found in Windows Internals, 5th Ed. and the Intel Manuals.
Is this different (higher) for 64-bit systems/applications or not?
Of course. 64bit Windows has a much larger address space, so clearly more PTEs are needed to map it.
Where can I find information about the
maximum number of PTE's in a process?
This is not so important as the maximum amount of user address space available in a process. (The number of PTEs is this number divided by the page size.)
This is 2GB on 32 bit Windows and much bigger on x64 Windows. (The actual number varies, but it's "big enough").
Problem is that although I have enough
memory, the application never starts
up completely because it runs out of
memory.
Are you a) leaking memory? b) using horribly inefficient algorithms?