CString use coupled with HeapWalk and HeapLock/HeapUnlock deadlocks in the kernel - c++

My goal is to lock virtual memory allocated for my process heaps (to prevent a possibility of it being swapped out to disk.)
I use the following code:
//pseudo-code, error checks are omitted for brevity
struct MEM_PAGE_TO_LOCK{
const BYTE* pBaseAddr; //Base address of the page
size_t szcbBlockSz; //Size of the block in bytes
MEM_PAGE_TO_LOCK()
: pBaseAddr(NULL)
, szcbBlockSz(0)
{
}
};
void WorkerThread(LPVOID pVoid)
{
//Called repeatedly from a worker thread
HANDLE hHeaps[256] = {0}; //Assume large array for the sake of this example
UINT nNumberHeaps = ::GetProcessHeaps(256, hHeaps);
if(nNumberHeaps > 256)
nNumberHeaps = 256;
std::vector<MEM_PAGE_TO_LOCK> arrPages;
for(UINT i = 0; i < nNumberHeaps; i++)
{
lockUnlockHeapAndWalkIt(hHeaps[i], arrPages);
}
//Now lock collected virtual memory
for(size_t p = 0; p < arrPages.size(); p++)
{
::VirtualLock((void*)arrPages[p].pBaseAddr, arrPages[p].szcbBlockSz);
}
}
void lockUnlockHeapAndWalkIt(HANDLE hHeap, std::vector<MEM_PAGE_TO_LOCK>& arrPages)
{
if(::HeapLock(hHeap))
{
__try
{
walkHeapAndCollectVMPages(hHeap, arrPages);
}
__finally
{
::HeapUnlock(hHeap);
}
}
}
void walkHeapAndCollectVMPages(HANDLE hHeap, std::vector<MEM_PAGE_TO_LOCK>& arrPages)
{
PROCESS_HEAP_ENTRY phe = {0};
MEM_PAGE_TO_LOCK mptl;
SYSTEM_INFO si = {0};
::GetSystemInfo(&si);
for(;;)
{
//Get next heap block
if(!::HeapWalk(hHeap, &phe))
{
if(::GetLastError() != ERROR_NO_MORE_ITEMS)
{
//Some other error
ASSERT(NULL);
}
break;
}
//We need to skip heap regions & uncommitted areas
//We're interested only in allocated blocks
if((phe.wFlags & (PROCESS_HEAP_REGION |
PROCESS_HEAP_UNCOMMITTED_RANGE | PROCESS_HEAP_ENTRY_BUSY)) == PROCESS_HEAP_ENTRY_BUSY)
{
if(phe.cbData &&
phe.lpData)
{
//Get address aligned at the page size boundary
size_t nRmndr = (size_t)phe.lpData % si.dwPageSize;
BYTE* pBegin = (BYTE*)((size_t)phe.lpData - nRmndr);
//Get segment size, also page aligned (round it up though)
BYTE* pLast = (BYTE*)phe.lpData + phe.cbData;
nRmndr = (size_t)pLast % si.dwPageSize;
if(nRmndr)
pLast += si.dwPageSize - nRmndr;
size_t szcbSz = pLast - pBegin;
//Do we have such a block already, or an adjacent one?
std::vector<MEM_PAGE_TO_LOCK>::iterator itr = arrPages.begin();
for(; itr != arrPages.end(); ++itr)
{
const BYTE* pLPtr = itr->pBaseAddr + itr->szcbBlockSz;
//See if they intersect or are adjacent
if(pLPtr >= pBegin &&
itr->pBaseAddr <= pLast)
{
//Intersected with another memory block
//Get the larger of the two
if(pBegin < itr->pBaseAddr)
itr->pBaseAddr = pBegin;
itr->szcbBlockSz = pLPtr > pLast ? pLPtr - itr->pBaseAddr : pLast - itr->pBaseAddr;
break;
}
}
if(itr == arrPages.end())
{
//Add new page
mptl.pBaseAddr = pBegin;
mptl.szcbBlockSz = szcbSz;
arrPages.push_back(mptl);
}
}
}
}
}
This method works, except that rarely the following happens. The app hangs up, UI and everything, and even if I try to run it with the Visual Studio debugger and then try to Break all, it shows an error message that no user-mode threads are running:
The process appears to be deadlocked (or is not running any user-mode
code). All threads have been stopped.
I tried it several times. The second time when the app hung up, I used the Task Manager to create dump file, after which I loaded the .dmp file into Visual Studio & analyzed it. The debugger showed that the deadlock happened somewhere in the kernel:
and if you review the call stack:
It points to the location of the code as such:
CString str;
str.Format(L"Some formatting value=%d, %s", value, etc);
Experimenting further with it, if I remove HeapLock and HeapUnlock calls from the code above, it doesn't seem to hang anymore. But then HeapWalk may sometimes issue an unhandled exception, access violation.
So any suggestions how to resolve this?

The problem is that you're using the C runtime's memory management, and more specifically the CRT's debug heap, while holding the operating system's heap lock.
The call stack you've posted includes _free_dbg, which always claims the CRT debug heap lock before taking any other action, so we know the thread holds the CRT debug heap lock. We can also see that the CRT was inside an operating system call made by _CrtIsValidHeapPointer when the deadlock occurred; the only such call is to HeapValidate and HEAP_NO_SERIALIZE is not specified.
So the thread whose call stack has been posted is holding the CRT debug heap lock and attempting to claim the operating system's heap lock.
The worker thread, on the other hand, holds the operating system's heap lock and makes calls that attempt to claim the CRT debug heap lock.
QED. Classic deadlock situation.
In a debug build, you will need to refrain from using any C or C++ library functions that might allocate or free memory while you are holding the corresponding operating system heap lock.
Even in a release build, you would still need to avoid any library functions that might allocate or release memory while holding a lock, which might be a problem if, for example, a hypothetical future implementation of std::vector was changed to make it thread-safe.
I recommend that you avoid the issue entirely, which is probably best done by creating a dedicated heap for your worker thread and taking all necessary memory allocations out of that heap. It would probably be best to exclude this heap from processing; the documentation for HeapWalk does not explicitly say that you should not modify the heap during enumeration, but it seems risky.

Related

32-bit malloc() return NULL when opening many threads?

I have a sample C++ program as below:
#include <windows.h>
#include <stdio.h>
int main(int argc, char* argv[])
{
void * pointerArr[20000];
int i = 0, j;
for (i = 0; i < 20000; i++) {
void * pointer = malloc(131125);
if (pointer == NULL) {
printf("i = %d, out of memory!\n", i);
getchar();
break;
}
pointerArr[i] = pointer;
}
for (j = 0; j < i; j++) {
free(pointerArr[j]);
}
getchar();
return 0;
}
When I run it with Visual Studio 32-bit Debug, it will run with following result:
The program can use nearly 2Gb of memory before out of memory.
This is normal behavior.
However, when I adding the code to start Thread inside the for loop as below:
#include <windows.h>
#include <stdio.h>
DWORD WINAPI thread_func(VOID* pInArgs)
{
Sleep(100000);
return 0;
}
int main(int argc, char* argv[])
{
void * pointerArr[20000];
int i = 0, j;
for (i = 0; i < 20000; i++) {
CreateThread(NULL, 0, thread_func, NULL, 0, NULL);
void * pointer = malloc(131125);
if (pointer == NULL) {
printf("i = %d, out of memory!\n", i);
getchar();
break;
}
pointerArr[i] = pointer;
}
for (j = 0; j < i; j++) {
free(pointerArr[j]);
}
getchar();
return 0;
}
The result is as below:
The memory is still just around 200Mb but function malloc will return NULL.
Could anyone help explain why the program cannot use the memory up to 2Gb before out of memory?
Is it mean creating many threads like above will cause memory leak?
In my real application, this error occur when I create about 800 threads, the RAM memory at the time "out of memory" is around 300Mb.
As noted in a comment by #macroland, the main thing happening here is that each thread is consuming 1 MiB for its stack (see MSDN CreateThread and Thread Stack Size). You say malloc returns NULL once the total you have directly allocated reaches 200 MB. Since you are allocating 131125 bytes at a time, that is 200 MB / 131125 B = 1525 threads. Their cumulative stack space will be around 1.5 GB. Adding the 200 MB of malloc memory is 1.7 GB, and miscellaneous overhead likely accounts for the rest.
So, why does Task Manager not show this? Because the full 1 MiB of thread stack space is not actually allocated (also called committed), rather it is reserved. See VirtualAlloc and the MEM_RESERVE flag. The address space has been reserved for expansion up to 1 MiB, but initially only 64 KiB are allocated, and Task Manager only counts the latter. But reserved memory will not be unilaterally repurposed by malloc until the reservation is lifted, so once it runs out of available address space, it has to return NULL.
What tool can show this? I don't know of anything off the shelf (even Process Explorer does not seem show a count of reserved memory). What I have done in the past is write my own little routine that uses VirtualQuery to enumerate the entire address space, including reserved ranges. I recommend you do the same; it's not much code to write, and very handy when coding for 32-bit Windows because the 2 GiB address space gets cramped very easily (DLLs are an obvious reason, but the default malloc also will leave unexpected reservations behind in response to certain allocation patterns even if you free everything).
In any case, if you want to create thousands of threads in a 32-bit Windows process, be sure to pass a non-zero value as the dwStackSize parameter to CreateThread, and also pass STACK_SIZE_PARAM_IS_A_RESERVATION as dwCreationFlags. The minimum is 64 KiB, which will be plenty if you avoid recursive algorithms in the threads.
Addendum: In a comment, #iinspectable cautions against using thousands of threads, citing Raymond Chen's 2005 blog post Does Windows have a limit of 2000 threads per process?. I agree that doing so is questionable for a variety of reasons; it is not my intent to endorse the practice, rather I'm just explaining one necessary element.

What could cause a mutex to misbehave?

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!
Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}

HeapWalk not working as expected in Release mode

So I used this example of the HeapWalk function to implement it into my app. I played around with it a bit and saw that when I added
HANDLE d = HeapAlloc(hHeap, 0, sizeof(int));
int* f = new(d) int;
after creating the heap then some new output would be logged:
Allocated block Data portion begins at: 0X037307E0
Size: 4 bytes
Overhead: 28 bytes
Region index: 0
So seeing this I thought I could check Entry.wFlags to see if it was set as PROCESS_HEAP_ENTRY_BUSY to keep a track of how much allocated memory I'm using on the heap. So I have:
HeapLock(heap);
int totalUsedSpace = 0, totalSize = 0, largestFreeSpace = 0, largestCounter = 0;
PROCESS_HEAP_ENTRY entry;
entry.lpData = NULL;
while (HeapWalk(heap, &entry) != FALSE)
{
int entrySize = entry.cbData + entry.cbOverhead;
if ((entry.wFlags & PROCESS_HEAP_ENTRY_BUSY) != 0)
{
// We have allocated memory in this block
totalUsedSpace += entrySize;
largestCounter = 0;
}
else
{
// We do not have allocated memory in this block
largestCounter += entrySize;
if (largestCounter > largestFreeSpace)
{
// Save this value as we've found a bigger space
largestFreeSpace = largestCounter;
}
}
// Keep a track of the total size of this heap
totalSize += entrySize;
}
HeapUnlock(heap);
And this appears to work when built in debug mode (totalSize and totalUsedSpace are different values). However, when I run it in Release mode totalUsedSpace is always 0.
I stepped through it with the debugger while in Release mode and for each heap it loops three times and I get the following flags in entry.wFlags from calling HeapWalk:
1 (PROCESS_HEAP_REGION)
0
2 (PROCESS_HEAP_UNCOMMITTED_RANGE)
It then exits the while loop and GetLastError() returns ERROR_NO_MORE_ITEMS as expected.
From here I found that a flag value of 0 is "the committed block which is free, i.e. not being allocated or not being used as control structure."
Does anyone know why it does not work as intended when built in Release mode? I don't have much experience of how memory is handled by the computer, so I'm not sure where the error might be coming from. Searching on Google didn't come up with anything so hopefully someone here knows.
UPDATE: I'm still looking into this myself and if I monitor the app using vmmap I can see that the process has 9 heaps, but when calling GetProcessHeaps it returns that there are 22 heaps. Also, none of the heap handles it returns matches to the return value of GetProcessHeap() or _get_heap_handle(). It seems like GetProcessHeaps is not behaving as expected. Here is the code to get the list of heaps:
// Count how many heaps there are and allocate enough space for them
DWORD numHeaps = GetProcessHeaps(0, NULL);
HANDLE* handles = new HANDLE[numHeaps];
// Get a handle to known heaps for us to compare against
HANDLE defaultHeap = GetProcessHeap();
HANDLE crtHeap = (HANDLE)_get_heap_handle();
// Get a list of handles to all the heaps
DWORD retVal = GetProcessHeaps(numHeaps, handles);
And retVal is the same value as numHeaps, which indicates that there was no error.
Application Verifier had been set up previously to do a full page heap verifying of my executable and was interfering with the heaps returned by GetProcessHeaps. I'd forgotten about it being set up as it was done for a different issue several days ago and then closed without clearing the tests. It wasn't happening in debug build because the application builds to a different file name for debug builds.
We managed to detect this by adding a breakpoint and looking at the callstack of the thread. We could see the AV DLL had been injected in and that let us know where to look.

Windows7 memory management - how to prevent concurrent threads from blocking

I'm working on a program consisting of two concurrent threads. One (here "Clock") is performing some computation on a regular basis (10 Hz) and is quite memory-intensive. The other one (here "hugeList") uses even more RAM but is not as time critical as the first one. So I decided to reduce its priority to THREAD_PRIORITY_LOWEST. Yet, when the thread frees most of the memory it has used the critical one doesn't manage to keep its timing.
I was able to condense down the problem to this bit of code (make sure optimizations are turned off!):
while Clock tries to keep a 10Hz-timing the hugeList-thread allocates and frees more and more memory not organized in any sort of chunks.
#include "stdafx.h"
#include <stdio.h>
#include <forward_list>
#include <time.h>
#include <windows.h>
#include <vector>
void wait_ms(double _ms)
{
clock_t endwait;
endwait = clock () + _ms * CLOCKS_PER_SEC/1000;
while (clock () < endwait) {} // active wait
}
void hugeList(void)
{
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_LOWEST);
unsigned int loglimit = 3;
unsigned int limit = 1000;
while(true)
{
for(signed int cnt=loglimit; cnt>0; cnt--)
{
printf(" Countdown %d...\n", cnt);
wait_ms(1000.0);
}
printf(" Filling list...\n");
std::forward_list<double> list;
for(unsigned int cnt=0; cnt<limit; cnt++)
list.push_front(42.0);
loglimit++;
limit *= 10;
printf(" Clearing list...\n");
while(!list.empty())
list.pop_front();
}
}
void Clock()
{
clock_t start = clock()-CLOCKS_PER_SEC*100/1000;
while(true)
{
std::vector<double> dummyData(100000, 42.0); // just get some memory
printf("delta: %d ms\n", (clock()-start)*1000/CLOCKS_PER_SEC);
start = clock();
wait_ms(100.0);
}
}
int main()
{
DWORD dwThreadId;
if (CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)&Clock, (LPVOID) NULL, 0, &dwThreadId) == NULL)
printf("Thread could not be created");
if (CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)&hugeList, (LPVOID) NULL, 0, &dwThreadId) == NULL)
printf("Thread could not be created");
while(true) {;}
return 0;
}
First of all I noticed that allocating memory for the linked list is way faster than freeing it.
On my machine (Windows7) at around the 4th iteration of the "hugeList"-method the Clock-Thread gets significantly disturbed (up to 200ms). The effect disappears without the dummyData-vector "asking" for some memory in the Clock-Thread.
So,
Is there any way of increasing the priority of memory allocation for the Clock-Thread in Win7?
Or do I have to split both operations onto two contexts (processes)?
Note that my original code uses some communication via shared variables which would require for some kind of IPC if I chose the second option.
Note that my original code gets stuck for about 1sec when the equivalent to the "hugeList"-method clears a boost::unordered_map and enters ntdll.dll!RtIInitializeCriticalSection many many times.
(observed by systinernals process explorer)
Note that the effects observed are not due to swapping, I'm using 1.4GB of my 16GB (64bit win7).
edit:
just wanted to let you know that up to now I haven't been able to solve my issue. Splitting both parts of the code onto two processes does not seem to be an option since my time is rather limited and I've never worked with processes so far. I'm afraid I won't be able to get to a running version in time.
However, I managed to reduce the effects by reducing the number of memory deallocations made by the non-critical thread. This was achieved by using a fast pooling memory allocator (like the one provided in the boost library).
There does not seem to be the possibility of explicitly creating certain objects (like e.g. the huge forward list in my example) on some sort of threadprivate heap that would not require synchronisation.
For further reading:
http://bmagic.sourceforge.net/memalloc.html
Do threads have a distinct heap?
Memory Allocation/Deallocation Bottleneck?
http://software.intel.com/en-us/articles/avoiding-heap-contention-among-threads
http://www.boost.org/doc/libs/1_55_0/libs/pool/doc/html/boost_pool/pool/introduction.html
Replacing std::forward_list with a std::list, I ran your code on a corei7 4GB machine until 2GB is consumed. No disturbances at all. (In debug build)
P.S
Yes. The release build recreates the issue. I replaced the forward list with an array
double* p = new double[limit];
for(unsigned int cnt=0; cnt<limit; cnt++)
p[cnt] = 42.0;
and
for(unsigned int cnt=0; cnt<limit; cnt++)
p[cnt] = -1;
delete [] p;
It does not recreates then.
It seems thread scheduler is punishing for asking for lot of small memory chunks.

How to find a memory leak in C++

What would be a good way to detect a C++ memory leak in an embedded environment? I tried overloading the new operator to log every data allocation, but I must have done something wrong, that approach isn't working. Has anyone else run into a similar situation?
This is the code for the new and delete operator overloading.
EDIT:
Full disclosure: I am looking for a memory leak in my program and I am using this code that someone else wrote to overload the new and delete operator. Part of my problem is the fact that I don't fully understand what it does. I know that the goal is to log the address of the caller and previous caller, the size of the allocation, a 1 if we are allocating, a 2 if we are deallocation. plus the name of the thread that is running.
Thanks for all the suggestions, I am going to try a different approach that someone here at work suggested. If it works, I will post it here.
Thanks again to all you first-rate programmers for taking the time to answer.
StackOverflow rocks!
Conclusion
Thanks for all the answers. Unfortunately, I had to move on to a different more pressing issue. This leak only occurred under a highly unlikely scenario. I feel crappy about just dropping it, I may go back to it if I have more time. I chose the answer I am most likely to use.
#include <stdlib.h>
#include "stdio.h"
#include "nucleus.h"
#include "plus/inc/dm_defs.h"
#include "plus/inc/pm_defs.h"
#include "posix\inc\posix.h"
extern void* TCD_Current_Thread;
extern "C" void rd_write_text(char * text);
extern PM_PCB * PMD_Created_Pools_List;
typedef struct {
void* addr;
uint16_t size;
uint16_t flags;
} MemLogEntryNarrow_t;
typedef struct {
void* addr;
uint16_t size;
uint16_t flags;
void* caller;
void* prev_caller;
void* taskid;
uint32_t timestamp;
} MemLogEntryWide_t;
//size lookup table
unsigned char MEM_bitLookupTable[] = {
0,1,1,2,1,2,2,3,1,2,2,3,1,3,3,4
};
//#pragma CODE_SECTION ("section_ramset1_0")
void *::operator new(unsigned int size)
{
asm(" STR R14, [R13, #0xC]"); //save stack address temp[0]
asm(" STR R13, [R13, #0x10]"); //save pc return address temp[1]
if ( loggingEnabled )
{
uint32_t savedInterruptState;
uint32_t currentIndex;
// protect the thread unsafe section.
savedInterruptState = NU_Local_Control_Interrupts(NU_DISABLE_INTERRUPTS);
// Note that this code is FRAGILE. It peeks backwards on the stack to find the return
// address of the caller. The location of the return address on the stack can be easily changed
// as a result of other changes in this function (i.e. adding local variables, etc).
// The offsets may need to be adjusted if this function is touched.
volatile unsigned int temp[2];
unsigned int *addr = (unsigned int *)temp[0] - 1;
unsigned int count = 1 + (0x20/4); //current stack space ***
//Scan for previous store
while ((*addr & 0xFFFF0000) != 0xE92D0000)
{
if ((*addr & 0xFFFFF000) == 0xE24DD000)
{
//add offset in words
count += ((*addr & 0xFFF) >> 2);
}
addr--;
}
count += MEM_bitLookupTable[*addr & 0xF];
count += MEM_bitLookupTable[(*addr >>4) & 0xF];
count += MEM_bitLookupTable[(*addr >> 8) & 0xF];
count += MEM_bitLookupTable[(*addr >> 12) & 0xF];
addr = (unsigned int *)temp[1] + count;
// FRAGILE CODE ENDS HERE
currentIndex = currentMemLogWriteIndex;
currentMemLogWriteIndex++;
if ( memLogNarrow )
{
if (currentMemLogWriteIndex >= MEMLOG_SIZE/2 )
{
loggingEnabled = false;
rd_write_text( "Allocation Logging is complete and DISABLED!\r\n\r\n");
}
// advance the read index if necessary.
if ( currentMemLogReadIndex == currentMemLogWriteIndex )
{
currentMemLogReadIndex++;
if ( currentMemLogReadIndex == MEMLOG_SIZE/2 )
{
currentMemLogReadIndex = 0;
}
}
NU_Local_Control_Interrupts(savedInterruptState);
//Standard operator
//(For Partition Analysis we have to consider that if we alloc size of 0 always as size of 1 then are partitions must be optimized for this)
if (size == 0) size = 1;
((MemLogEntryNarrow_t*)memLog)[currentIndex].size = size;
((MemLogEntryNarrow_t*)memLog)[currentIndex].flags = 1; //allocated
//Standard operator
void * ptr;
ptr = malloc(size);
((MemLogEntryNarrow_t*)memLog)[currentIndex].addr = ptr;
return ptr;
}
else
{
if (currentMemLogWriteIndex >= MEMLOG_SIZE/6 )
{
loggingEnabled = false;
rd_write_text( "Allocation Logging is complete and DISABLED!\r\n\r\n");
}
// advance the read index if necessary.
if ( currentMemLogReadIndex == currentMemLogWriteIndex )
{
currentMemLogReadIndex++;
if ( currentMemLogReadIndex == MEMLOG_SIZE/6 )
{
currentMemLogReadIndex = 0;
}
}
((MemLogEntryWide_t*)memLog)[currentIndex].caller = (void *)(temp[0] - 4);
((MemLogEntryWide_t*)memLog)[currentIndex].prev_caller = (void *)*addr;
NU_Local_Control_Interrupts(savedInterruptState);
((MemLogEntryWide_t*)memLog)[currentIndex].taskid = (void *)TCD_Current_Thread;
((MemLogEntryWide_t*)memLog)[currentIndex].size = size;
((MemLogEntryWide_t*)memLog)[currentIndex].flags = 1; //allocated
((MemLogEntryWide_t*)memLog)[currentIndex].timestamp = *(volatile uint32_t *)0xfffbc410; // for arm9
//Standard operator
if (size == 0) size = 1;
void * ptr;
ptr = malloc(size);
((MemLogEntryWide_t*)memLog)[currentIndex].addr = ptr;
return ptr;
}
}
else
{
//Standard operator
if (size == 0) size = 1;
void * ptr;
ptr = malloc(size);
return ptr;
}
}
//#pragma CODE_SECTION ("section_ramset1_0")
void ::operator delete(void *ptr)
{
uint32_t savedInterruptState;
uint32_t currentIndex;
asm(" STR R14, [R13, #0xC]"); //save stack address temp[0]
asm(" STR R13, [R13, #0x10]"); //save pc return address temp[1]
if ( loggingEnabled )
{
savedInterruptState = NU_Local_Control_Interrupts(NU_DISABLE_INTERRUPTS);
// Note that this code is FRAGILE. It peeks backwards on the stack to find the return
// address of the caller. The location of the return address on the stack can be easily changed
// as a result of other changes in this function (i.e. adding local variables, etc).
// The offsets may need to be adjusted if this function is touched.
volatile unsigned int temp[2];
unsigned int *addr = (unsigned int *)temp[0] - 1;
unsigned int count = 1 + (0x20/4); //current stack space ***
//Scan for previous store
while ((*addr & 0xFFFF0000) != 0xE92D0000)
{
if ((*addr & 0xFFFFF000) == 0xE24DD000)
{
//add offset in words
count += ((*addr & 0xFFF) >> 2);
}
addr--;
}
count += MEM_bitLookupTable[*addr & 0xF];
count += MEM_bitLookupTable[(*addr >>4) & 0xF];
count += MEM_bitLookupTable[(*addr >> 8) & 0xF];
count += MEM_bitLookupTable[(*addr >> 12) & 0xF];
addr = (unsigned int *)temp[1] + count;
// FRAGILE CODE ENDS HERE
currentIndex = currentMemLogWriteIndex;
currentMemLogWriteIndex++;
if ( memLogNarrow )
{
if ( currentMemLogWriteIndex >= MEMLOG_SIZE/2 )
{
loggingEnabled = false;
rd_write_text( "Allocation Logging is complete and DISABLED!\r\n\r\n");
}
// advance the read index if necessary.
if ( currentMemLogReadIndex == currentMemLogWriteIndex )
{
currentMemLogReadIndex++;
if ( currentMemLogReadIndex == MEMLOG_SIZE/2 )
{
currentMemLogReadIndex = 0;
}
}
NU_Local_Control_Interrupts(savedInterruptState);
// finish logging the fields. these are thread safe so they dont need to be inside the protected section.
((MemLogEntryNarrow_t*)memLog)[currentIndex].addr = ptr;
((MemLogEntryNarrow_t*)memLog)[currentIndex].size = 0;
((MemLogEntryNarrow_t*)memLog)[currentIndex].flags = 2; //unallocated
}
else
{
((MemLogEntryWide_t*)memLog)[currentIndex].caller = (void *)(temp[0] - 4);
((MemLogEntryWide_t*)memLog)[currentIndex].prev_caller = (void *)*addr;
if ( currentMemLogWriteIndex >= MEMLOG_SIZE/6 )
{
loggingEnabled = false;
rd_write_text( "Allocation Logging is complete and DISABLED!\r\n\r\n");
}
// advance the read index if necessary.
if ( currentMemLogReadIndex == currentMemLogWriteIndex )
{
currentMemLogReadIndex++;
if ( currentMemLogReadIndex == MEMLOG_SIZE/6 )
{
currentMemLogReadIndex = 0;
}
}
NU_Local_Control_Interrupts(savedInterruptState);
// finish logging the fields. these are thread safe so they dont need to be inside the protected section.
((MemLogEntryWide_t*)memLog)[currentIndex].addr = ptr;
((MemLogEntryWide_t*)memLog)[currentIndex].size = 0;
((MemLogEntryWide_t*)memLog)[currentIndex].flags = 2; //unallocated
((MemLogEntryWide_t*)memLog)[currentIndex].taskid = (void *)TCD_Current_Thread;
((MemLogEntryWide_t*)memLog)[currentIndex].timestamp = *(volatile uint32_t *)0xfffbc410; // for arm9
}
//Standard operator
if (ptr != NULL) {
free(ptr);
}
}
else
{
//Standard operator
if (ptr != NULL) {
free(ptr);
}
}
}
If you're running Linux, I suggest trying Valgrind.
There are several forms of operator new:
void *operator new (size_t);
void *operator new [] (size_t);
void *operator new (size_t, void *);
void *operator new [] (size_t, void *);
void *operator new (size_t, /* parameters of your choosing! */);
void *operator new [] (size_t, /* parameters of your choosing! */);
All the above can exist at both global and class scope. For each operator new, there is an equivalent operator delete. You need to make sure you are adding logging to all versions of the operator, if that is the way you want to do it.
Ideally, you would want the system to behave the same regardless of whether the memory logging is present or not. For example, the MS VC run time library allocates more memory in debug than in release because it prefixes the memory allocation with a bigger information block and adds guard blocks to the start and end of the allocation. The best solution is to keep all the memory logging information is a separate chunk or memory and use a map to track the memory. This can also be used to verify that the memory passed to delete is valid.
new
allocate memory
add entry to logging table
delete
check address exists in logging table
free memory
However, you're writing embedded software where, usually, memory is a limited resource. It is usually preferable on these systems to avoid dynamic memory allocation for several reasons:
You know how much memory there is so you know in advance how many objects you can allocate. Allocation should never return null as that is usually terminal with no easy way of getting back to a healthy system.
Allocating and freeing memory leads to fragmentation. The number of objects you can allocate will decrease over time. You could write a memory compactor to move allocated objects around to free up bigger chunks of memory but that will affect performance. As in point 1, once you get a null, things get tricky.
So, when doing embedded work, you usually know up front how much memory can be allocated to various objects and, knowing this, you can write more efficient memory managers for each object type that can take appropriate action when memory runs out - discarding old items, crashing, etc.
EDIT
If you want to know what called the memory allocation, the best thing to do is use a macro (I know, macros are generally bad):
#define NEW new (__FILE__, __LINE__, __FUNCTION__)
and define an operator new:
void *operator new (size_t size, char *file, int line, char *function)
{
// log the allocation somewhere, no need to strcpy file or function, just save the
// pointer values
return malloc (size);
}
and use it like this:
SomeObject *obj = NEW SomeObject (parameters);
You compiler might not have the __FUNCTION__ preprocessor definition so you can safely omit it.
http://www.linuxjournal.com/article/6059
Actually from my experience it always better to create memory pools for embedded systems and use custom allocator/de-allocator. We can easily identify the leaks. For example, we had a simple custom memory manager for vxworks, where we store the task id, timestamp in the allocated mem block.
One way is to insert file name and line number strings (via pointer) of the module allocating memory into the allocated block of data. The file and line number is handled by using the C++ standard "__FILE__" and "__LINE__" macros. When the memory is de-allocated, that information is removed.
One of our systems has this feature and we call it a "memory hog report". So anytime from our CLI we can print out all the allocated memory along with a big list of information of who has allocated memory. This list is sorted by which code module has the most memory allocated. Many times we'll monitor memory usage this way over time, and eventually the memory hog (leak) will bubble up to the top of the list.
overloading new and delete should work if you pay close attention.
Maybe you can show us what isn't working about that approach?
I'm not an embedded environment expert, so the only advise I can give is to test as much code as you can on your development machine using your favorite free or proprietary tools. Tools for a particular embedded platform may also exist and you can use them for final testing. But most powerful tools are for desktops.
On desktop environment I like the job DevPartner Studio does. This is for Windows and proprietary. There're free tools available for Linux but I don't have much expirience with them. As an example there's EFence
If you override constructor and destructor of your classes, you can print to the screen or a log file. Doing this will give you an idea of when things are being created, what is being created, as well as the same information for deletion.
For easy browsing, you can add a temporary global variable, "INSTANCE_ID", and print this (and increment) on every constructor/destructor call. Then you can browse by ID, and it should make goings a little easier.
The way we did it with our C 3D toolkit was to create custom new/malloc and delete macros that logged each allocation and deallocation to a file. We had to ensure that all the code called our macros of course. The writing to the log file was controlled by a run time flag and only happened under debug so we didn't have to recompile.
Once the run was complete a post-processor ran over the file matching allocations to deallocations and reported any unmatched allocations.
It had a performance hit, but we only needed to do it once it a while.
Is there a real need to roll your own memory leak detection?
Assuming you can't use dynamic memory checkers, like the open-source valgrind tool on Linux, static analysis tools like the commercial products Coverity Prevent and Klocwork Insight may be of use. I've used all three, and have had very good results with all of them.
Lots of good answers.
I would just point out that if the program is one that, like a small command-line utility, runs for a short period of time and then releases all its memory back to the OS, memory leaks probably do no harm.
You can use a third party tool to do this.
You can detect leaks within your own class structures by adding memory counters in your New and Delete calls to increment/decrement the memory counters, and print out a report at your application close. However, this won't detect memory leaks for memory allocated outside your class system - a third party tool can do this though.
Can you describe what is "not working" with your log methods?
Do you not get the expected logs? or, are they showing everything is fine but you still have leaks?
How have you confirmed that this is definitely a leak and not some other type of corruption?
One way to check your overloading is correct: Instantiate a counter object per class, increment this in the new and decrement it in the delete of the class. If you have a growing count, you have a leak. Now, you would expect your log lines to be coinciding with the increment and decrement points.
Not specifically for embedded development, but we used to use BoundsChecker for that.
Use smart pointers and never think about it again, there's loads of official types around but are pretty easy to roll your own too:
class SmartCoMem
{
public:
SmartCoMem() : m_size( 0 ), m_ptr64( 0 ) {
}
~SmartCoMem() {
if( m_size )
CoTaskMemFree((LPVOID)m_ptr64);
}
void copyin( LPCTSTR in, const unsigned short size )
{
LPVOID ptr;
ptr = CoTaskMemRealloc( (LPVOID)m_ptr64, size );
if( ptr == NULL )
throw std::exception( "SmartCoMem: CoTaskMemAlloc Failed" );
else
{
m_size = size;
m_ptr64 = (__int64)ptr;
memcpy( (LPVOID)m_ptr64, in, size );
}
}
std::string copyout( ) {
std::string out( (LPCSTR)m_ptr64, m_size );
return out;
}
__int64* ptr() {
return &m_ptr64;
}
unsigned short size() {
return m_size;
}
unsigned short* sizePtr() {
return &m_size;
}
bool loaded() {
return m_size > 0;
}
private:
//don't allow copying as this is a wrapper around raw memory
SmartCoMem (const SmartCoMem &);
SmartCoMem & operator = (const SmartCoMem &);
__int64 m_ptr64;
unsigned short m_size;
};
There's no encapsulation in this example due to the API I was working with but still better than working with completely raw pointers.
For testing like this, try compiling your embedded code natively for Linux (or whatever OS you use), and use the a well-established tool like Valgrind to test for memory leaks. It can be a challenge to do this, but you just need to replace any code that directly accesses hardware with some code that simulates something suitable for your testing.
I've found that using SWIG to convert my embedded code into a linux-native library and running the code from a Python script is really effective. You can use all of the tools that are available for non-embedded projects, and test all of your code except the hardware drivers.