Communication protocol and local loopback using setjmp / longjmp

Communication protocol and local loopback using setjmp / longjmp - c++

I've coded some relatively simple communication protocol using shared memory, and shared mutexes. But then I wanted to expand support to communicate between two .dll's having different run-time in use. It's quite obvious that if you have some std::vector<__int64> and two dll's - one vs2010, one vs2015 - they won't work politely with each other. Then I've thought - why I cannot serialize in ipc manner structure on one side and de-serialize it on another - then vs run-times will work smoothly with each other.
Long story short - I've created separate interface for sending next chunk of data and for requesting next chunk of data. Both are working while decoding happens - meaning if you have vector with 10 entries, each string 1 Mb, and shared memory is 10 Kb - then it would require 1*10*1024/10 times to transfer whole data. Each next request is followed by multiple outstanding function calls - either by SendChunk or GetNextChunk depending on transfer direction.
Now - I wanted to encode and decode happen simultaneously but without any threading - then I've came up with solution of using setjmp and longjmp. I'm attaching part of code below, just for you to get some understanding of what is happening in whole machinery.
#include "..."
#include <setjmp.h> //setjmp
class Jumper: public IMessageSerializer
{
public:
char lbuf[ sizeof(IpcCommand) + 10 ];
jmp_buf jbuf1;
jmp_buf jbuf2;
bool bChunkSupplied;
Jumper() :
bChunkSupplied(false)
{
memset( lbuf, 0 , sizeof(lbuf) );
}
virtual bool GetNextChunk( bool bSend, int offset )
{
if( !bChunkSupplied )
{
bChunkSupplied = true;
return true;
}
int r = setjmp(jbuf1);
((_JUMP_BUFFER *)&jbuf1)->Frame = 0;
if( r == 0 )
longjmp(jbuf2, 1);
bChunkSupplied = true;
return true;
}
virtual bool SendChunk( bool bLast )
{
bChunkSupplied = false;
int r = setjmp(jbuf2);
((_JUMP_BUFFER *)&jbuf2)->Frame = 0;
if( r == 0 )
longjmp(jbuf1, 1);
return true;
}
bool FlushReply( bool bLast )
{
return true;
}
IpcCommand* getCmd( void )
{
return (IpcCommand*) lbuf;
}
int bufSize( void )
{
return 10;
}
}; //class Jumper
Jumper jumper;
void main(void)
{
EncDecCtx enc(&jumper, true, true);
EncDecCtx dec(&jumper, false, false);
CString s;
if( setjmp(jumper.jbuf1) == 0 )
{
alloca(16*1024);
enc.encodeString(L"My testing my very very long string.");
enc.FlushBuffer(true);
} else {
dec.decodeString(s);
}
wprintf(L"%s\r\n", s.GetBuffer() );
}
There are couple of issues here. After first call to setjmp I'm using alloca() - which allocates memory from stack, it will be autofreed on return. alloca can happen only during first jump, because any function call always uses callstack (to save return address) and it can corrupt second "thread" call stack.
There are multiple articles discussing about how dangerous setjmp and longjmp are, but this is now somehow working solution. The stack size (16 Kb) is reservation for next function calls to come - decodeString and so on - it can be adjusted to bigger if not enough.
After trying out this code I've noticed that x86 code was working fine, but 64-but did not work - I've got similar problem to what is described here:
An invalid or unaligned stack was encountered during an unwind operation
Like article suggested I've added ((_JUMP_BUFFER *)&jbuf1)->Frame = 0; kind of resetting - and after that 64-bit code started to work. Currently library is not using any exception mechanism and I'm not planning to use any (will try-catch everything if needed in encode* decode* function calls.
So questions:
Is it acceptable solution to disable unwinding in code ? (((_JUMP_BUFFER *)&jbuf1)->Frame = 0;) What unwinding really means in context setjmp / longjmp ?
Do you see any potential problem with given code snipet?

Related

C++ lockless queue crashes with multiple threads

I'm trying to gain better understanding of controlling memory order when coding for multiple threads. I've used mutexes a lot in the past to serialize variable access, but I'm trying to avoid those where possible to improve performance.
I have a queue of pointers that may be filled by many threads and consumed by many threads. It works fine with a single thread, but crashes when I run with multiple threads. It looks like the consumers may be getting duplicates of the pointers which causes them to be freed twice. It's a little hard to tell since when I put in any print statements, it runs fine without crashing.
To start with I'm using a pre-allocated vector to hold the pointers. I keep 3 atomic index variables to keep track of what elements in the vector need processing. It may be worth noting that I tried using a _queue type where the elements themselves were atomic by that did not seem to help. Here is the simpler version:
std::atomic<uint32_t> iread;
std::atomic<uint32_t> iwrite;
std::atomic<uint32_t> iend;
std::vector<JEvent*> _queue;
// Write to _queue (may be thread 1,2,3,...)
while(!_done){
uint32_t idx = iwrite.load();
uint32_t inext = (idx+1)%_queue.size();
if( inext == iread.load() ) return kQUEUE_FULL;
if( iwrite.compare_exchange_weak(idx, inext) ){
_queue[idx] = jevent; // jevent is JEvent* passed into this method
while( !_done ){
if( iend.compare_exchange_weak(idx, inext) ) break;
}
break;
}
}
and from the same class
// Read from _queue (may be thread 1,2,3,...)
while(!_done){
uint32_t idx = iread.load();
if(idx == iend.load()) return NULL;
JEvent *Event = _queue[idx];
uint32_t inext = (idx+1)%_queue.size();
if( iread.compare_exchange_weak(idx, inext) ){
_nevents_processed++;
return Event;
}
}
I should emphasize that I am really interested in understanding why this doesn't work. Implementing some other pre-made package would get me past this problem, but would not help me avoid making the same type of mistakes again later.
UPDATE
I'm marking Alexandr Konovalov's answer as correct (see my comment in his answer below). In case anyone comes across this page, the corrected code for the "Write" section is:
std::atomic<uint32_t> iread;
std::atomic<uint32_t> iwrite;
std::atomic<uint32_t> iend;
std::vector<JEvent*> _queue;
// Write to _queue (may be thread 1,2,3,...)
while(!_done){
uint32_t idx = iwrite.load();
uint32_t inext = (idx+1)%_queue.size();
if( inext == iread.load() ) return kQUEUE_FULL;
if( iwrite.compare_exchange_weak(idx, inext) ){
_queue[idx] = jevent; // jevent is JEvent* passed into this method
uint32_t save_idx = idx;
while( !_done ){
if( iend.compare_exchange_weak(idx, inext) ) break;
idx = save_idx;
}
break;
}
}

To me, one possible issue can occurs when there are 2 writers and 1 reader. Suppose that 1st writer stops just before
_queue[0] = jevent;
and 2nd writer signals via iend that its _queue[1] is ready to be read. Then, reader via iend sees that _queue[0] is ready to be read, so we have data race.
I recommend you try Relacy Race Detector, that ideally applies to such kind of analysis.

CString use coupled with HeapWalk and HeapLock/HeapUnlock deadlocks in the kernel

My goal is to lock virtual memory allocated for my process heaps (to prevent a possibility of it being swapped out to disk.)
I use the following code:
//pseudo-code, error checks are omitted for brevity
struct MEM_PAGE_TO_LOCK{
const BYTE* pBaseAddr; //Base address of the page
size_t szcbBlockSz; //Size of the block in bytes
MEM_PAGE_TO_LOCK()
: pBaseAddr(NULL)
, szcbBlockSz(0)
{
}
};
void WorkerThread(LPVOID pVoid)
{
//Called repeatedly from a worker thread
HANDLE hHeaps[256] = {0}; //Assume large array for the sake of this example
UINT nNumberHeaps = ::GetProcessHeaps(256, hHeaps);
if(nNumberHeaps > 256)
nNumberHeaps = 256;
std::vector<MEM_PAGE_TO_LOCK> arrPages;
for(UINT i = 0; i < nNumberHeaps; i++)
{
lockUnlockHeapAndWalkIt(hHeaps[i], arrPages);
}
//Now lock collected virtual memory
for(size_t p = 0; p < arrPages.size(); p++)
{
::VirtualLock((void*)arrPages[p].pBaseAddr, arrPages[p].szcbBlockSz);
}
}
void lockUnlockHeapAndWalkIt(HANDLE hHeap, std::vector<MEM_PAGE_TO_LOCK>& arrPages)
{
if(::HeapLock(hHeap))
{
__try
{
walkHeapAndCollectVMPages(hHeap, arrPages);
}
__finally
{
::HeapUnlock(hHeap);
}
}
}
void walkHeapAndCollectVMPages(HANDLE hHeap, std::vector<MEM_PAGE_TO_LOCK>& arrPages)
{
PROCESS_HEAP_ENTRY phe = {0};
MEM_PAGE_TO_LOCK mptl;
SYSTEM_INFO si = {0};
::GetSystemInfo(&si);
for(;;)
{
//Get next heap block
if(!::HeapWalk(hHeap, &phe))
{
if(::GetLastError() != ERROR_NO_MORE_ITEMS)
{
//Some other error
ASSERT(NULL);
}
break;
}
//We need to skip heap regions & uncommitted areas
//We're interested only in allocated blocks
if((phe.wFlags & (PROCESS_HEAP_REGION |
PROCESS_HEAP_UNCOMMITTED_RANGE | PROCESS_HEAP_ENTRY_BUSY)) == PROCESS_HEAP_ENTRY_BUSY)
{
if(phe.cbData &&
phe.lpData)
{
//Get address aligned at the page size boundary
size_t nRmndr = (size_t)phe.lpData % si.dwPageSize;
BYTE* pBegin = (BYTE*)((size_t)phe.lpData - nRmndr);
//Get segment size, also page aligned (round it up though)
BYTE* pLast = (BYTE*)phe.lpData + phe.cbData;
nRmndr = (size_t)pLast % si.dwPageSize;
if(nRmndr)
pLast += si.dwPageSize - nRmndr;
size_t szcbSz = pLast - pBegin;
//Do we have such a block already, or an adjacent one?
std::vector<MEM_PAGE_TO_LOCK>::iterator itr = arrPages.begin();
for(; itr != arrPages.end(); ++itr)
{
const BYTE* pLPtr = itr->pBaseAddr + itr->szcbBlockSz;
//See if they intersect or are adjacent
if(pLPtr >= pBegin &&
itr->pBaseAddr <= pLast)
{
//Intersected with another memory block
//Get the larger of the two
if(pBegin < itr->pBaseAddr)
itr->pBaseAddr = pBegin;
itr->szcbBlockSz = pLPtr > pLast ? pLPtr - itr->pBaseAddr : pLast - itr->pBaseAddr;
break;
}
}
if(itr == arrPages.end())
{
//Add new page
mptl.pBaseAddr = pBegin;
mptl.szcbBlockSz = szcbSz;
arrPages.push_back(mptl);
}
}
}
}
}
This method works, except that rarely the following happens. The app hangs up, UI and everything, and even if I try to run it with the Visual Studio debugger and then try to Break all, it shows an error message that no user-mode threads are running:
The process appears to be deadlocked (or is not running any user-mode
code). All threads have been stopped.
I tried it several times. The second time when the app hung up, I used the Task Manager to create dump file, after which I loaded the .dmp file into Visual Studio & analyzed it. The debugger showed that the deadlock happened somewhere in the kernel:
and if you review the call stack:
It points to the location of the code as such:
CString str;
str.Format(L"Some formatting value=%d, %s", value, etc);
Experimenting further with it, if I remove HeapLock and HeapUnlock calls from the code above, it doesn't seem to hang anymore. But then HeapWalk may sometimes issue an unhandled exception, access violation.
So any suggestions how to resolve this?

The problem is that you're using the C runtime's memory management, and more specifically the CRT's debug heap, while holding the operating system's heap lock.
The call stack you've posted includes _free_dbg, which always claims the CRT debug heap lock before taking any other action, so we know the thread holds the CRT debug heap lock. We can also see that the CRT was inside an operating system call made by _CrtIsValidHeapPointer when the deadlock occurred; the only such call is to HeapValidate and HEAP_NO_SERIALIZE is not specified.
So the thread whose call stack has been posted is holding the CRT debug heap lock and attempting to claim the operating system's heap lock.
The worker thread, on the other hand, holds the operating system's heap lock and makes calls that attempt to claim the CRT debug heap lock.
QED. Classic deadlock situation.
In a debug build, you will need to refrain from using any C or C++ library functions that might allocate or free memory while you are holding the corresponding operating system heap lock.
Even in a release build, you would still need to avoid any library functions that might allocate or release memory while holding a lock, which might be a problem if, for example, a hypothetical future implementation of std::vector was changed to make it thread-safe.
I recommend that you avoid the issue entirely, which is probably best done by creating a dedicated heap for your worker thread and taking all necessary memory allocations out of that heap. It would probably be best to exclude this heap from processing; the documentation for HeapWalk does not explicitly say that you should not modify the heap during enumeration, but it seems risky.

MongoDB C driver efficiency

I'm trying to write a program whose job it is to go into shared memory, retrieve a piece of information (a struct 56 bytes in size), then parse that struct lightly and write it to a database.
The catch is that it needs to do this several dozens of thousands of times per second. I'm running this on a dedicated Ubuntu 14.04 server with dual Xeon X5677's and 32GB RAM. Also, Mongo is running PerconaFT as its storage engine. I am making an uneducated guess here, but say worst case load scenario would be 100,000 writes per second.
Shared memory is populated by another program who's reading information from a real time data stream, so I can't necessarily reproduce scenarios.
First... is Mongo the right choice for this task?
Next, this is the code that I've got right now. It starts with creating a list of collections (the list of items I want to record data points on is fixed) and then retrieving data from shared memory until it catches a signal.
int main()
{
//these deal with navigating shared memory
uint latestNotice=0, latestTurn=0, latestPQ=0, latestPQturn=0;
symbol_data *notice = nullptr;
bool done = false;
//this is our 56 byte struct
pq item;
uint64_t today_at_midnight; //since epoch, in milliseconds
{
time_t seconds = time(NULL);
today_at_midnight = seconds/(60*60*24);
today_at_midnight *= (60*60*24*1000);
}
//connect to shared memory
infob=info_block_init();
uint32_t used_symbols = infob->used_symbols;
getPosition(latestNotice, latestTurn);
//fire up mongo
mongoc_client_t *client = nullptr;
mongoc_collection_t *collections[used_symbols];
mongoc_collection_t *collection = nullptr;
bson_error_t error;
bson_t *doc = nullptr;
mongoc_init();
client = mongoc_client_new("mongodb://localhost:27017/");
for(uint32_t symbol = 0; symbol < used_symbols; symbol++)
{
collections[symbol] = mongoc_client_get_collection(client, "scribe",
(infob->sd+symbol)->getSymbol());
}
//this will be used later to sleep one millisecond
struct timespec ts;
ts.tv_sec=0;
ts.tv_nsec=1000000;
while(continue_running) //becomes false if a signal is caught
{
//check that new info is available in shared memory
//sleep 1ms if it isn't
while(!getNextNotice(&notice,latestNotice,latestTurn)) nanosleep(&ts, NULL);
//get the new info
done=notice->getNextItem(item, latestPQ, latestPQturn);
if(done) continue;
//just some simple array math to make sure we're on the right collection
collection = collections[notice - infob->sd];
//switch on the item type and parse it accordingly
switch(item.tp)
{
case pq::pq_event:
doc = BCON_NEW(
//decided to use this instead of std::chrono
"ts", BCON_DATE_TIME(today_at_midnight + item.ts),
//item.pr is a uint64_t, and the guidance I've read on mongo
//advises using strings for those values
"pr", BCON_UTF8(std::to_string(item.pr).c_str()),
"sz", BCON_INT32(item.sz),
"vn", BCON_UTF8(venue_labels[item.vn]),
"tp", BCON_UTF8("e")
);
if(!mongoc_collection_insert(collection, MONGOC_INSERT_NONE, doc, NULL, &error))
{
LOG(1,"Mongo Error: "<<error.message<<endl);
}
break;
//obviously, several other cases go here, but they all look the
//same, using BCON macros for their data.
default:
LOG(1,"got unknown type = "<<item.tp<<endl);
break;
}
}
//clean up once we break from the while()
if(doc != nullptr) bson_destroy(doc);
for(uint32_t symbol = 0; symbol < used_symbols; symbol++)
{
collection = collections[symbol];
mongoc_collection_destroy(collection);
}
if(client != nullptr) mongoc_client_destroy(client);
mongoc_cleanup();
return 0;
}
My second question is: is this the fastest way to do this? The retrieval from shared memory isn't perfect, but this program is getting way behind its supply of data, far moreso than I need it to be. So I'm looking for obvious mistakes with regards to efficiency or technique when speed is the goal.
Thanks in advance. =)

What could cause a mutex to misbehave?

I've been busy the last couple of months debugging a rare crash caused somewhere within a very large proprietary C++ image processing library, compiled with GCC 4.7.2 for an ARM Cortex-A9 Linux target. Since a common symptom was glibc complaining about heap corruption, the first step was to employ a heap corruption checker to catch oob memory writes. I used the technique described in https://stackoverflow.com/a/17850402/3779334 to divert all calls to free/malloc to my own function, padding every allocated chunk of memory with some amount of known data to catch out-of-bounds writes - but found nothing, even when padding with as much as 1 KB before and after every single allocated block (there are hundreds of thousands of allocated blocks due to intensive use of STL containers, so I can't enlarge the padding further, plus I assume any write more than 1KB out of bounds would eventually trigger a segfault anyway). This bounds checker has found other problems in the past so I don't doubt its functionality.
(Before anyone says 'Valgrind', yes, I have tried that too with no results either.)
Now, my memory bounds checker also has a feature where it prepends every allocated block with a data struct. These structs are all linked in one long linked list, to allow me to occasionally go over all allocations and test memory integrity. For some reason, even though all manipulations of this list are mutex protected, the list was getting corrupted. When investigating the issue, it began to seem like the mutex itself was occasionally failing to do its job. Here is the pseudocode:
pthread_mutex_t alloc_mutex;
static bool boolmutex; // set to false during init. volatile has no effect.
void malloc_wrapper() {
// ...
pthread_mutex_lock(&alloc_mutex);
if (boolmutex) {
printf("mutex misbehaving\n");
__THROW_ERROR__; // this happens!
}
boolmutex = true;
// manipulate linked list here
boolmutex = false;
pthread_mutex_unlock(&alloc_mutex);
// ...
}
The code commented with "this happens!" is occasionally reached, even though this seems impossible. My first theory was that the mutex data structure was being overwritten. I placed the mutex within a struct, with large arrays before and after it, but when this problem occurred the arrays were untouched so nothing seems to be overwritten.
So.. What kind of corruption could possibly cause this to happen, and how would I find and fix the cause?
A few more notes. The test program uses 3-4 threads for processing. Running with less threads seems to make the corruptions less common, but not disappear. The test runs for about 20 seconds each time and completes successfully in the vast majority of cases (I can have 10 units repeating the test, with the first failure occurring after 5 minutes to several hours). When the problem occurs it is quite late in the test (say, 15 seconds in), so this isn't a bad initialization issue. The memory bounds checker never catches actual out of bounds writes but glibc still occasionally fails with a corrupted heap error (Can such an error be caused by something other than an oob write?). Each failure generates a core dump with plenty of trace information; there is no pattern I can see in these dumps, no particular section of code that shows up more than others. This problem seems very specific to a particular family of algorithms and does not happen in other algorithms, so I'm quite certain this isn't a sporadic hardware or memory error. I have done many more tests to check for oob heap accesses which I don't want to list to keep this post from getting any longer.
Thanks in advance for any help!

Thanks to all commenters. I've tried nearly all suggestions with no results, when I finally decided to write a simple memory allocation stress test - one that would run a thread on each of the CPU cores (my unit is a Freescale i.MX6 quad core SoC), each allocating and freeing memory in random order at high speed. The test crashed with a glibc memory corruption error within minutes or a few hours at most.
Updating the kernel from 3.0.35 to 3.0.101 solved the problem; both the stress test and the image processing algorithm now run overnight without failing. The problem does not reproduce on Intel machines with the same kernel version, so the problem is specific either to ARM in general or perhaps to some patch Freescale included with the specific BSP version that included kernel 3.0.35.
For those curious, attached is the stress test source code. Set NUM_THREADS to the number of CPU cores and build with:
<cross-compiler-prefix>g++ -O3 test_heap.cpp -lpthread -o test_heap
I hope this information helps someone. Cheers :)
// Multithreaded heap stress test. By Itay Chamiel 20151012.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <assert.h>
#include <pthread.h>
#include <sys/time.h>
#define NUM_THREADS 4 // set to number of CPU cores
#define ALIVE_INDICATOR NUM_THREADS
// Each thread constantly allocates and frees memory. In each iteration of the infinite loop, decide at random whether to
// allocate or free a block of memory. A list of 500-1000 allocated blocks is maintained by each thread. When memory is allocated
// it is added to this list; when freeing, a random block is selected from this list, freed and removed from the list.
void* thr(void* arg) {
int* alive_flag = (int*)arg;
int thread_id = *alive_flag; // this is a number between 0 and (NUM_THREADS-1) given by main()
int cnt = 0;
timeval t_pre, t_post;
gettimeofday(&t_pre, NULL);
const int ALLOCATE=1, FREE=0;
const unsigned int MINSIZE=500, MAXSIZE=1000;
const int MAX_ALLOC=10000;
char* membufs[MAXSIZE];
unsigned int membufs_size = 0;
int num_allocs = 0, num_frees = 0;
while(1)
{
int action;
// Decide whether to allocate or free a memory block.
// if we have less than MINSIZE buffers, allocate.
if (membufs_size < MINSIZE) action = ALLOCATE;
// if we have MAXSIZE, free.
else if (membufs_size >= MAXSIZE) action = FREE;
// else, decide randomly.
else {
action = ((rand() & 0x1)? ALLOCATE : FREE);
}
if (action == ALLOCATE) {
// choose size to allocate, from 1 to MAX_ALLOC bytes
size_t size = (rand() % MAX_ALLOC) + 1;
// allocate and fill memory
char* buf = (char*)malloc(size);
memset(buf, 0x77, size);
// add buffer to list
membufs[membufs_size] = buf;
membufs_size++;
assert(membufs_size <= MAXSIZE);
num_allocs++;
}
else { // action == FREE
// choose a random buffer to free
size_t pos = rand() % membufs_size;
assert (pos < membufs_size);
// free and remove from list by replacing entry with last member
free(membufs[pos]);
membufs[pos] = membufs[membufs_size-1];
membufs_size--;
assert(membufs_size >= 0);
num_frees++;
}
// once in 10 seconds print a status update
gettimeofday(&t_post, NULL);
if (t_post.tv_sec - t_pre.tv_sec >= 10) {
printf("Thread %d [%d] - %d allocs %d frees. Alloced blocks %u.\n", thread_id, cnt++, num_allocs, num_frees, membufs_size);
gettimeofday(&t_pre, NULL);
}
// indicate alive to main thread
*alive_flag = ALIVE_INDICATOR;
}
return NULL;
}
int main()
{
int alive_flag[NUM_THREADS];
printf("Memory allocation stress test running on %d threads.\n", NUM_THREADS);
// start a thread for each core
for (int i=0; i<NUM_THREADS; i++) {
alive_flag[i] = i; // tell each thread its ID.
pthread_t th;
int ret = pthread_create(&th, NULL, thr, &alive_flag[i]);
assert(ret == 0);
}
while(1) {
sleep(10);
// check that all threads are alive
bool ok = true;
for (int i=0; i<NUM_THREADS; i++) {
if (alive_flag[i] != ALIVE_INDICATOR)
{
printf("Thread %d is not responding\n", i);
ok = false;
}
}
assert(ok);
for (int i=0; i<NUM_THREADS; i++)
alive_flag[i] = 0;
}
return 0;
}

How to find a memory leak in C++

What would be a good way to detect a C++ memory leak in an embedded environment? I tried overloading the new operator to log every data allocation, but I must have done something wrong, that approach isn't working. Has anyone else run into a similar situation?
This is the code for the new and delete operator overloading.
EDIT:
Full disclosure: I am looking for a memory leak in my program and I am using this code that someone else wrote to overload the new and delete operator. Part of my problem is the fact that I don't fully understand what it does. I know that the goal is to log the address of the caller and previous caller, the size of the allocation, a 1 if we are allocating, a 2 if we are deallocation. plus the name of the thread that is running.
Thanks for all the suggestions, I am going to try a different approach that someone here at work suggested. If it works, I will post it here.
Thanks again to all you first-rate programmers for taking the time to answer.
StackOverflow rocks!
Conclusion
Thanks for all the answers. Unfortunately, I had to move on to a different more pressing issue. This leak only occurred under a highly unlikely scenario. I feel crappy about just dropping it, I may go back to it if I have more time. I chose the answer I am most likely to use.
#include <stdlib.h>
#include "stdio.h"
#include "nucleus.h"
#include "plus/inc/dm_defs.h"
#include "plus/inc/pm_defs.h"
#include "posix\inc\posix.h"
extern void* TCD_Current_Thread;
extern "C" void rd_write_text(char * text);
extern PM_PCB * PMD_Created_Pools_List;
typedef struct {
void* addr;
uint16_t size;
uint16_t flags;
} MemLogEntryNarrow_t;
typedef struct {
void* addr;
uint16_t size;
uint16_t flags;
void* caller;
void* prev_caller;
void* taskid;
uint32_t timestamp;
} MemLogEntryWide_t;
//size lookup table
unsigned char MEM_bitLookupTable[] = {
0,1,1,2,1,2,2,3,1,2,2,3,1,3,3,4
};
//#pragma CODE_SECTION ("section_ramset1_0")
void *::operator new(unsigned int size)
{
asm(" STR R14, [R13, #0xC]"); //save stack address temp[0]
asm(" STR R13, [R13, #0x10]"); //save pc return address temp[1]
if ( loggingEnabled )
{
uint32_t savedInterruptState;
uint32_t currentIndex;
// protect the thread unsafe section.
savedInterruptState = NU_Local_Control_Interrupts(NU_DISABLE_INTERRUPTS);
// Note that this code is FRAGILE. It peeks backwards on the stack to find the return
// address of the caller. The location of the return address on the stack can be easily changed
// as a result of other changes in this function (i.e. adding local variables, etc).
// The offsets may need to be adjusted if this function is touched.
volatile unsigned int temp[2];
unsigned int *addr = (unsigned int *)temp[0] - 1;
unsigned int count = 1 + (0x20/4); //current stack space ***
//Scan for previous store
while ((*addr & 0xFFFF0000) != 0xE92D0000)
{
if ((*addr & 0xFFFFF000) == 0xE24DD000)
{
//add offset in words
count += ((*addr & 0xFFF) >> 2);
}
addr--;
}
count += MEM_bitLookupTable[*addr & 0xF];
count += MEM_bitLookupTable[(*addr >>4) & 0xF];
count += MEM_bitLookupTable[(*addr >> 8) & 0xF];
count += MEM_bitLookupTable[(*addr >> 12) & 0xF];
addr = (unsigned int *)temp[1] + count;
// FRAGILE CODE ENDS HERE
currentIndex = currentMemLogWriteIndex;
currentMemLogWriteIndex++;
if ( memLogNarrow )
{
if (currentMemLogWriteIndex >= MEMLOG_SIZE/2 )
{
loggingEnabled = false;
rd_write_text( "Allocation Logging is complete and DISABLED!\r\n\r\n");
}
// advance the read index if necessary.
if ( currentMemLogReadIndex == currentMemLogWriteIndex )
{
currentMemLogReadIndex++;
if ( currentMemLogReadIndex == MEMLOG_SIZE/2 )
{
currentMemLogReadIndex = 0;
}
}
NU_Local_Control_Interrupts(savedInterruptState);
//Standard operator
//(For Partition Analysis we have to consider that if we alloc size of 0 always as size of 1 then are partitions must be optimized for this)
if (size == 0) size = 1;
((MemLogEntryNarrow_t*)memLog)[currentIndex].size = size;
((MemLogEntryNarrow_t*)memLog)[currentIndex].flags = 1; //allocated
//Standard operator
void * ptr;
ptr = malloc(size);
((MemLogEntryNarrow_t*)memLog)[currentIndex].addr = ptr;
return ptr;
}
else
{
if (currentMemLogWriteIndex >= MEMLOG_SIZE/6 )
{
loggingEnabled = false;
rd_write_text( "Allocation Logging is complete and DISABLED!\r\n\r\n");
}
// advance the read index if necessary.
if ( currentMemLogReadIndex == currentMemLogWriteIndex )
{
currentMemLogReadIndex++;
if ( currentMemLogReadIndex == MEMLOG_SIZE/6 )
{
currentMemLogReadIndex = 0;
}
}
((MemLogEntryWide_t*)memLog)[currentIndex].caller = (void *)(temp[0] - 4);
((MemLogEntryWide_t*)memLog)[currentIndex].prev_caller = (void *)*addr;
NU_Local_Control_Interrupts(savedInterruptState);
((MemLogEntryWide_t*)memLog)[currentIndex].taskid = (void *)TCD_Current_Thread;
((MemLogEntryWide_t*)memLog)[currentIndex].size = size;
((MemLogEntryWide_t*)memLog)[currentIndex].flags = 1; //allocated
((MemLogEntryWide_t*)memLog)[currentIndex].timestamp = *(volatile uint32_t *)0xfffbc410; // for arm9
//Standard operator
if (size == 0) size = 1;
void * ptr;
ptr = malloc(size);
((MemLogEntryWide_t*)memLog)[currentIndex].addr = ptr;
return ptr;
}
}
else
{
//Standard operator
if (size == 0) size = 1;
void * ptr;
ptr = malloc(size);
return ptr;
}
}
//#pragma CODE_SECTION ("section_ramset1_0")
void ::operator delete(void *ptr)
{
uint32_t savedInterruptState;
uint32_t currentIndex;
asm(" STR R14, [R13, #0xC]"); //save stack address temp[0]
asm(" STR R13, [R13, #0x10]"); //save pc return address temp[1]
if ( loggingEnabled )
{
savedInterruptState = NU_Local_Control_Interrupts(NU_DISABLE_INTERRUPTS);
// Note that this code is FRAGILE. It peeks backwards on the stack to find the return
// address of the caller. The location of the return address on the stack can be easily changed
// as a result of other changes in this function (i.e. adding local variables, etc).
// The offsets may need to be adjusted if this function is touched.
volatile unsigned int temp[2];
unsigned int *addr = (unsigned int *)temp[0] - 1;
unsigned int count = 1 + (0x20/4); //current stack space ***
//Scan for previous store
while ((*addr & 0xFFFF0000) != 0xE92D0000)
{
if ((*addr & 0xFFFFF000) == 0xE24DD000)
{
//add offset in words
count += ((*addr & 0xFFF) >> 2);
}
addr--;
}
count += MEM_bitLookupTable[*addr & 0xF];
count += MEM_bitLookupTable[(*addr >>4) & 0xF];
count += MEM_bitLookupTable[(*addr >> 8) & 0xF];
count += MEM_bitLookupTable[(*addr >> 12) & 0xF];
addr = (unsigned int *)temp[1] + count;
// FRAGILE CODE ENDS HERE
currentIndex = currentMemLogWriteIndex;
currentMemLogWriteIndex++;
if ( memLogNarrow )
{
if ( currentMemLogWriteIndex >= MEMLOG_SIZE/2 )
{
loggingEnabled = false;
rd_write_text( "Allocation Logging is complete and DISABLED!\r\n\r\n");
}
// advance the read index if necessary.
if ( currentMemLogReadIndex == currentMemLogWriteIndex )
{
currentMemLogReadIndex++;
if ( currentMemLogReadIndex == MEMLOG_SIZE/2 )
{
currentMemLogReadIndex = 0;
}
}
NU_Local_Control_Interrupts(savedInterruptState);
// finish logging the fields. these are thread safe so they dont need to be inside the protected section.
((MemLogEntryNarrow_t*)memLog)[currentIndex].addr = ptr;
((MemLogEntryNarrow_t*)memLog)[currentIndex].size = 0;
((MemLogEntryNarrow_t*)memLog)[currentIndex].flags = 2; //unallocated
}
else
{
((MemLogEntryWide_t*)memLog)[currentIndex].caller = (void *)(temp[0] - 4);
((MemLogEntryWide_t*)memLog)[currentIndex].prev_caller = (void *)*addr;
if ( currentMemLogWriteIndex >= MEMLOG_SIZE/6 )
{
loggingEnabled = false;
rd_write_text( "Allocation Logging is complete and DISABLED!\r\n\r\n");
}
// advance the read index if necessary.
if ( currentMemLogReadIndex == currentMemLogWriteIndex )
{
currentMemLogReadIndex++;
if ( currentMemLogReadIndex == MEMLOG_SIZE/6 )
{
currentMemLogReadIndex = 0;
}
}
NU_Local_Control_Interrupts(savedInterruptState);
// finish logging the fields. these are thread safe so they dont need to be inside the protected section.
((MemLogEntryWide_t*)memLog)[currentIndex].addr = ptr;
((MemLogEntryWide_t*)memLog)[currentIndex].size = 0;
((MemLogEntryWide_t*)memLog)[currentIndex].flags = 2; //unallocated
((MemLogEntryWide_t*)memLog)[currentIndex].taskid = (void *)TCD_Current_Thread;
((MemLogEntryWide_t*)memLog)[currentIndex].timestamp = *(volatile uint32_t *)0xfffbc410; // for arm9
}
//Standard operator
if (ptr != NULL) {
free(ptr);
}
}
else
{
//Standard operator
if (ptr != NULL) {
free(ptr);
}
}
}

If you're running Linux, I suggest trying Valgrind.

There are several forms of operator new:
void *operator new (size_t);
void *operator new [] (size_t);
void *operator new (size_t, void *);
void *operator new [] (size_t, void *);
void *operator new (size_t, /* parameters of your choosing! */);
void *operator new [] (size_t, /* parameters of your choosing! */);
All the above can exist at both global and class scope. For each operator new, there is an equivalent operator delete. You need to make sure you are adding logging to all versions of the operator, if that is the way you want to do it.
Ideally, you would want the system to behave the same regardless of whether the memory logging is present or not. For example, the MS VC run time library allocates more memory in debug than in release because it prefixes the memory allocation with a bigger information block and adds guard blocks to the start and end of the allocation. The best solution is to keep all the memory logging information is a separate chunk or memory and use a map to track the memory. This can also be used to verify that the memory passed to delete is valid.
new
allocate memory
add entry to logging table
delete
check address exists in logging table
free memory
However, you're writing embedded software where, usually, memory is a limited resource. It is usually preferable on these systems to avoid dynamic memory allocation for several reasons:
You know how much memory there is so you know in advance how many objects you can allocate. Allocation should never return null as that is usually terminal with no easy way of getting back to a healthy system.
Allocating and freeing memory leads to fragmentation. The number of objects you can allocate will decrease over time. You could write a memory compactor to move allocated objects around to free up bigger chunks of memory but that will affect performance. As in point 1, once you get a null, things get tricky.
So, when doing embedded work, you usually know up front how much memory can be allocated to various objects and, knowing this, you can write more efficient memory managers for each object type that can take appropriate action when memory runs out - discarding old items, crashing, etc.
EDIT
If you want to know what called the memory allocation, the best thing to do is use a macro (I know, macros are generally bad):
#define NEW new (__FILE__, __LINE__, __FUNCTION__)
and define an operator new:
void *operator new (size_t size, char *file, int line, char *function)
{
// log the allocation somewhere, no need to strcpy file or function, just save the
// pointer values
return malloc (size);
}
and use it like this:
SomeObject *obj = NEW SomeObject (parameters);
You compiler might not have the __FUNCTION__ preprocessor definition so you can safely omit it.

http://www.linuxjournal.com/article/6059
Actually from my experience it always better to create memory pools for embedded systems and use custom allocator/de-allocator. We can easily identify the leaks. For example, we had a simple custom memory manager for vxworks, where we store the task id, timestamp in the allocated mem block.

One way is to insert file name and line number strings (via pointer) of the module allocating memory into the allocated block of data. The file and line number is handled by using the C++ standard "__FILE__" and "__LINE__" macros. When the memory is de-allocated, that information is removed.
One of our systems has this feature and we call it a "memory hog report". So anytime from our CLI we can print out all the allocated memory along with a big list of information of who has allocated memory. This list is sorted by which code module has the most memory allocated. Many times we'll monitor memory usage this way over time, and eventually the memory hog (leak) will bubble up to the top of the list.

overloading new and delete should work if you pay close attention.
Maybe you can show us what isn't working about that approach?

I'm not an embedded environment expert, so the only advise I can give is to test as much code as you can on your development machine using your favorite free or proprietary tools. Tools for a particular embedded platform may also exist and you can use them for final testing. But most powerful tools are for desktops.
On desktop environment I like the job DevPartner Studio does. This is for Windows and proprietary. There're free tools available for Linux but I don't have much expirience with them. As an example there's EFence

If you override constructor and destructor of your classes, you can print to the screen or a log file. Doing this will give you an idea of when things are being created, what is being created, as well as the same information for deletion.
For easy browsing, you can add a temporary global variable, "INSTANCE_ID", and print this (and increment) on every constructor/destructor call. Then you can browse by ID, and it should make goings a little easier.

The way we did it with our C 3D toolkit was to create custom new/malloc and delete macros that logged each allocation and deallocation to a file. We had to ensure that all the code called our macros of course. The writing to the log file was controlled by a run time flag and only happened under debug so we didn't have to recompile.
Once the run was complete a post-processor ran over the file matching allocations to deallocations and reported any unmatched allocations.
It had a performance hit, but we only needed to do it once it a while.

Is there a real need to roll your own memory leak detection?
Assuming you can't use dynamic memory checkers, like the open-source valgrind tool on Linux, static analysis tools like the commercial products Coverity Prevent and Klocwork Insight may be of use. I've used all three, and have had very good results with all of them.

Lots of good answers.
I would just point out that if the program is one that, like a small command-line utility, runs for a short period of time and then releases all its memory back to the OS, memory leaks probably do no harm.

You can use a third party tool to do this.
You can detect leaks within your own class structures by adding memory counters in your New and Delete calls to increment/decrement the memory counters, and print out a report at your application close. However, this won't detect memory leaks for memory allocated outside your class system - a third party tool can do this though.

Can you describe what is "not working" with your log methods?
Do you not get the expected logs? or, are they showing everything is fine but you still have leaks?
How have you confirmed that this is definitely a leak and not some other type of corruption?
One way to check your overloading is correct: Instantiate a counter object per class, increment this in the new and decrement it in the delete of the class. If you have a growing count, you have a leak. Now, you would expect your log lines to be coinciding with the increment and decrement points.

Not specifically for embedded development, but we used to use BoundsChecker for that.

Use smart pointers and never think about it again, there's loads of official types around but are pretty easy to roll your own too:
class SmartCoMem
{
public:
SmartCoMem() : m_size( 0 ), m_ptr64( 0 ) {
}
~SmartCoMem() {
if( m_size )
CoTaskMemFree((LPVOID)m_ptr64);
}
void copyin( LPCTSTR in, const unsigned short size )
{
LPVOID ptr;
ptr = CoTaskMemRealloc( (LPVOID)m_ptr64, size );
if( ptr == NULL )
throw std::exception( "SmartCoMem: CoTaskMemAlloc Failed" );
else
{
m_size = size;
m_ptr64 = (__int64)ptr;
memcpy( (LPVOID)m_ptr64, in, size );
}
}
std::string copyout( ) {
std::string out( (LPCSTR)m_ptr64, m_size );
return out;
}
__int64* ptr() {
return &m_ptr64;
}
unsigned short size() {
return m_size;
}
unsigned short* sizePtr() {
return &m_size;
}
bool loaded() {
return m_size > 0;
}
private:
//don't allow copying as this is a wrapper around raw memory
SmartCoMem (const SmartCoMem &);
SmartCoMem & operator = (const SmartCoMem &);
__int64 m_ptr64;
unsigned short m_size;
};
There's no encapsulation in this example due to the API I was working with but still better than working with completely raw pointers.

For testing like this, try compiling your embedded code natively for Linux (or whatever OS you use), and use the a well-established tool like Valgrind to test for memory leaks. It can be a challenge to do this, but you just need to replace any code that directly accesses hardware with some code that simulates something suitable for your testing.
I've found that using SWIG to convert my embedded code into a linux-native library and running the code from a Python script is really effective. You can use all of the tools that are available for non-embedded projects, and test all of your code except the hardware drivers.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js