Pointer passed to function changes unexpectedly

Pointer passed to function changes unexpectedly - c++

I'm designing a preloader-based lock tracing utility that attaches to Pthreads, and I've run into a weird issue. The program works by providing wrappers that replace relevant Pthreads functions at runtime; these do some logging, and then pass the args to the real Pthreads function to do the work. They do not modify the arguments passed to them, obviously. However, when testing, I discovered that the condition variable pointer passed to my pthread_cond_wait() wrapper does not match the one that gets passed to the underlying Pthreads function, which promptly crashes with "futex facility returned an unexpected error code," which, from what I've gathered, usually indicates an invalid sync object passed in. Relevant stack trace from GDB:
#8 __pthread_cond_wait (cond=0x7f1b14000d12, mutex=0x55a2b961eec0) at pthread_cond_wait.c:638
#9 0x00007f1b1a47b6ae in pthread_cond_wait (cond=0x55a2b961f290, lk=0x55a2b961eec0)
at pthread_trace.cpp:56
I'm pretty mystified. Here's the code for my pthread_cond_wait() wrapper:
int pthread_cond_wait(pthread_cond_t* cond, pthread_mutex_t* lk) {
// log arrival at wait
the_tracer.add_event(lktrace::event::COND_WAIT, (size_t) cond);
// run pthreads function
GET_REAL_FN(pthread_cond_wait, int, pthread_cond_t*, pthread_mutex_t*);
int e = REAL_FN(cond, lk);
if (e == 0) the_tracer.add_event(lktrace::event::COND_LEAVE, (size_t) cond);
else {
the_tracer.add_event(lktrace::event::COND_ERR, (size_t) cond);
}
return e;
}
// GET_REAL_FN is defined as:
#define GET_REAL_FN(name, rtn, params...) \
typedef rtn (*real_fn_t)(params); \
static const real_fn_t REAL_FN = (real_fn_t) dlsym(RTLD_NEXT, #name); \
assert(REAL_FN != NULL) // semicolon absence intentional
And here's the code for __pthread_cond_wait in glibc 2.31 (this is the function that gets called if you call pthread_cond_wait normally, it has a different name because of versioning stuff. The stack trace above confirms that this is the function that REAL_FN points to):
int
__pthread_cond_wait (pthread_cond_t *cond, pthread_mutex_t *mutex)
{
/* clockid is unused when abstime is NULL. */
return __pthread_cond_wait_common (cond, mutex, 0, NULL);
}
As you can see, neither of these functions modifies cond, yet it is not the same in the two frames. Examining the two different pointers in a core dump shows that they point to different contents, as well. I can also see in the core dump that cond does not appear to change in my wrapper function (i.e. it's still equal to 0x5... in frame 9 at the crash point, which is the call to REAL_FN). I can't really tell which pointer is correct by looking at their contents, but I'd assume it's the one passed in to my wrapper from the target application. Both pointers point to valid segments for program data (marked ALLOC, LOAD, HAS_CONTENTS).
My tool is definitely causing the error somehow, the target application runs fine if it is not attached. What am I missing?
UPDATE: Actually, this doesn't appear to be what's causing the error, because calls to my pthread_cond_wait() wrapper succeed many times before the error occurs, and exhibit similar behavior (pointer value changing between frames without explanation) each time. I'm leaving the question open, though, because I still don't understand what's going on here and I'd like to learn.
UPDATE 2: As requested, here's the code for tracer.add_event():
// add an event to the calling thread's history
// hist_entry ctor gets timestamp & stack trace
void tracer::add_event(event e, size_t obj_addr) {
size_t tid = get_tid();
hist_map::iterator hist = histories.contains(tid);
assert(hist != histories.end());
hist_entry ev (e, obj_addr);
hist->second.push_back(ev);
}
// hist_entry ctor:
hist_entry::hist_entry(event e, size_t obj_addr) :
ts(chrono::steady_clock::now()), ev(e), addr(obj_addr) {
// these are set in the tracer ctor
assert(start_addr && end_addr);
void* buf[TRACE_DEPTH];
int v = backtrace(buf, TRACE_DEPTH);
int a = 0;
// find first frame outside of our own code
while (a < v && start_addr < (size_t) buf[a] &&
end_addr > (size_t) buf[a]) ++a;
// skip requested amount of frames
a += TRACE_SKIP;
if (a >= v) a = v-1;
caller = buf[a];
}
histories is a lock-free concurrent hashmap from libcds (mapping tid->per-thread vectors of hist_entry), and its iterators are guaranteed to be thread-safe as well. GNU docs say backtrace() is thread-safe, and there's no data races mentioned in the CPP docs for steady_clock::now(). get_tid() just calls pthread_self() using the same method as the wrapper functions, and casts its result to size_t.

Hah, figured it out! The issue is that Glibc exposes multiple versions of pthread_cond_wait(), for backwards compatibility. The version I reproduce in my question is the current version, the one we want to call. The version that dlsym() was finding is the backwards-compatible version:
int
__pthread_cond_wait_2_0 (pthread_cond_2_0_t *cond, pthread_mutex_t *mutex)
{
if (cond->cond == NULL)
{
pthread_cond_t *newcond;
newcond = (pthread_cond_t *) calloc (sizeof (pthread_cond_t), 1);
if (newcond == NULL)
return ENOMEM;
if (atomic_compare_and_exchange_bool_acq (&cond->cond, newcond, NULL))
/* Somebody else just initialized the condvar. */
free (newcond);
}
return __pthread_cond_wait (cond->cond, mutex);
}
As you can see, this version tail-calls the current one, which is probably why this took so long to detect: GDB is normally pretty good at detecting frames elided by tail calls, but I'm guessing it didn't detect this one because the functions have the "same" name (and the error doesn't affect the mutex functions because they don't expose multiple versions). This blog post goes into much more detail, coincidentally specifically about pthread_cond_wait(). I stepped through this function many times while debugging and sort of tuned it out, because every call into glibc is wrapped in multiple layers of indirection; I only realized what was going on when I set a breakpoint on the pthread_cond_wait symbol, instead of a line number, and it stopped at this function.
Anyway, this explains the changing pointer phenomenon: what happens is that the old, incorrect function gets called, reinterprets the pthread_cond_t object as a struct containing a pointer to a pthread_cond_t object, allocates a new pthread_cond_t for that pointer, and then passes the newly allocated one to the new, correct function. The frame of the old function gets elided by the tail-call, and to a GDB backtrace after leaving the old function it looks like the correct function gets called directly from my wrapper, with a mysteriously changed argument.
The fix for this was simple: GNU provides the libdl extension dlvsym(), which is like dlsym() but also takes a version string. Looking for pthread_cond_wait with version string "GLIBC_2.3.2" solves the problem. Note that these versions do not usually correspond to the current version (i.e. pthread_create()/exit() have version string "GLIBC_2.2.5"), so they need to be looked up on a per-function basis. The correct string can be determined either by looking at the compat_symbol() or versioned_symbol() macros that are somewhere near the function definition in the glibc source, or by using readelf to see the names of the symbols in the compiled library (mine has "pthread_cond_wait##GLIBC_2.3.2" and "pthread_cond_wait##GLIBC_2.2.5").

Related

Proper compiler intrinsics for double-checked locking?

When implementing double-checked locking, what is the proper way to do the memory and/or compiler barriers when implementing double-checked locking for initialization?
Something like std::call_once isn't what I want; it's way too slow. It's typically just implemented on top of pthread_mutex_lock and EnterCriticalSection respective to OS.
In my programs, I often run into initialization cases where the initialization is safe to repeat, as long as exactly one thread gets to set the final pointer. If another thread beats it to setting the final pointer to the singleton object, it deletes what it created and makes use of the other thread's. I also often use this in cases where it doesn't matter which thread "wins" because they all come up with the same result.
Here's an unsafe, overly-contrived example, using Visual C++ intrinsics:
MyClass *GetGlobalMyClass()
{
static MyClass *const UNSET_POINTER = reinterpret_cast<MyClass *>(
static_cast<intptr_t>(-1));
static MyClass *volatile s_object = UNSET_POINTER;
if (s_object == UNSET_POINTER)
{
MyClass *newObject = MyClass::Create();
if (_InterlockedCompareExchangePointer(&s_object, newObject,
UNSET_POINTER) != UNSET_POINTER)
{
// Another thread beat us. If Create didn't return null, destroy.
if (newObject)
{
newObject->Destroy(); // calls "delete this;", presumably
}
}
}
return s_object;
}
On a weakly-ordered memory architecture, my understanding is that it's possible that the new value of s_object is visible to other threads before other variables written inside MyClass::Create or MyClass::MyClass are visible. Also, the compiler itself could arrange the code this way in the absence of a compiler barrier (in Visual C++, _WriteBarrier, but _InterlockedCompareExchange acts as a barrier).
Do I need like a store fence intrinsic function in there or something in order to ensure that MyClass's variables are visible to all threads before s_object becomes somethings besides -1?

Fortunately, the rules in C++ are very simple:
If there is a data race, the behaviour is undefined.
In you code the data race is caused by the following read, which conflicts with the write operation in __InterlockedCompareExchangePointer.
if (s_object.m_void == UNSET_POINTER)
A thread-safe solution without blocking might look as follows. Note that on x86 a load operation with sequential consistency has basically no overhead compared to a regular load operation. If you care about other architectures, you can also use acquire release instead of sequential consistency.
static std::atomic<MyClass*> s_object{nullptr};
MyClass* o = s_object.load(std::memory_order_seq_cst);
if (o == nullptr) {
o = new MyClass{...};
MyClass* expected = nullptr;
if (!s_object.compare_exchange_strong(expected, o, std::memory_order_seq_cst)) {
delete o;
o = expected;
}
}
return o;

For a proper C++11 implementation any function-local static variable will be constructed in a thread-safe fashion by the first thread passing through this variable.

Calling a function when thread is exiting in PThreads or Windows

I am creating a C++ library for both Linux (with PThreads) and Windows (with their built-in WinThreads) which can be attached to any program, and needs to have a function called when the thread is exiting, similar to how atexit works for processes.
I know of pthread_cleanup_push and pthread_cleanup_pop for pthreads, but these do not work for me since they are macros that add another lexical scope, whereas I want to declare this function the first time my library is called into, and then allow the program itself to run its own code however it needs to. I haven't found anything similar in Windows whatsoever.
Note that this doesn't mean I want an outside thread to be alerted when the thread stops, or even that I can change the way the thread will be exited, since that is controlled by the program itself, my library is just attached, along for the ride.
So the question is: What is the best way, in this instance, for me to have a function I've written called when the thread closes, in either Windows or Linux, when I have no control over how the thread is created or destroyed?
For example in main program:
void* threadFunc(void* arg)
{
printf("Hello world!\n");
return NULL;
}
int main(int argc, char** argv)
{
int numThreads = 1;
pid_t* pids = NULL;
pids = (pid_t*) calloc(sizeof(pid_t), numThreads);
pthread_create(&ntid, NULL, threadFunc, &nVal);
pthreads[0] = ntid;
pthread_join(pthreads[0], NULL);
return 0;
}
In library:
void callMeOnExit()
{
printf("Exiting Thread!\n");
}
I would want for callMeOnExit to be called when the thread reaches return NULL; in this case, as well as when the main thread reaches the return 0;. Wrapping pthread_exit would work for other cases, and could be a solution, but I'd like a better one if possible.
If anyone has any ideas on how I might be able to do this, that would be great!

So after a few code reviews, we were able to find a much more elegant way to do this in Linux, which matches both what Windows does with Fibers (as Neeraj points out) as well as what I expected to find when I started looking into this issue.
The key is that pthread_key_create takes in, as the second argument, a pointer to a destructor, which is called when any thread which has initialized this TLS data dies. I was already using TLS elsewhere per thread, but a simple store into TLS would get you this feature as well to ensure it was called.

Change this:
pthread_create(&ntid, NULL, threadFunc, &nVal);
into:
struct exitInformData
{
void* (CB*)(void*);
void* data;
exitInformData(void* (cp*)(void*), void* dp): CB(cp) data(dp) {}
};
pthread_create(&ntid, NULL, exitInform, new exitInformData(&threadFunc, &nVal));
Then Add:
void* exitInform(void* data)
{
exitInformData* ei = reinterpret_cast<exitInformData*>(data);
void* r = (ei.CB)(ei.data); // Calls the function you want.
callMeOnExit(); // Calls the exit notification.
delete ei;
return r;
}

For Windows, you could try Fls Callbacks. They FLS system can be used to allocate per thread (ignore the 'fiber' part, each thread contains one fiber) storage. You get this callback to free the storage, but can do other things in the callback as well.

I found out that this has already been asked, although the solution given then may not be the same as what you want...
Another idea might be to simply extend from the pthread_t class/struct, and override the pthread_exit call to call another function as you want it to, then call the superclass pthread_exit

longjmp and RAII

So I have a library (not written by me) which unfortunately uses abort() to deal with certain errors. At the application level, these errors are recoverable so I would like to handle them instead of the user seeing a crash. So I end up writing code like this:
static jmp_buf abort_buffer;
static void abort_handler(int) {
longjmp(abort_buffer, 1); // perhaps siglongjmp if available..
}
int function(int x, int y) {
struct sigaction new_sa;
struct sigaction old_sa;
sigemptyset(&new_sa.sa_mask);
new_sa.sa_handler = abort_handler;
sigaction(SIGABRT, &new_sa, &old_sa);
if(setjmp(abort_buffer)) {
sigaction(SIGABRT, &old_sa, 0);
return -1
}
// attempt to do some work here
int result = f(x, y); // may call abort!
sigaction(SIGABRT, &old_sa, 0);
return result;
}
Not very elegant code. Since this pattern ends up having to be repeated in a few spots of the code, I would like to simplify it a little and possibly wrap it in a reusable object. My first attempt involves using RAII to handle the setup/teardown of the signal handler (needs to be done because each function needs different error handling). So I came up with this:
template <int N>
struct signal_guard {
signal_guard(void (*f)(int)) {
sigemptyset(&new_sa.sa_mask);
new_sa.sa_handler = f;
sigaction(N, &new_sa, &old_sa);
}
~signal_guard() {
sigaction(N, &old_sa, 0);
}
private:
struct sigaction new_sa;
struct sigaction old_sa;
};
static jmp_buf abort_buffer;
static void abort_handler(int) {
longjmp(abort_buffer, 1);
}
int function(int x, int y) {
signal_guard<SIGABRT> sig_guard(abort_handler);
if(setjmp(abort_buffer)) {
return -1;
}
return f(x, y);
}
Certainly the body of function is much simpler and more clear this way, but this morning a thought occurred to me. Is this guaranteed to work? Here's my thoughts:
No variables are volatile or change between calls to setjmp/longjmp.
I am longjmping to a location in the same stack frame as the setjmp and returning normally, so I am allowing the code to execute the cleanup code that the compiler emitted at the exit points of the function.
It appears to work as expected.
But I still get the feeling that this is likely undefined behavior. What do you guys think?

I assume that f is in a third party library/app, because otherwise you could just fix it to not call abort. Given that, and that RAII may or may not reliably produce the right results on all platforms/compilers, you have a few options.
Create a tiny shared object that defines abort and LD_PRELOAD it. Then you control what happens on abort, and NOT in a signal handler.
Run f within a subprocess. Then you just check the return code and if it failed try again with updated inputs.
Instead of using the RAII, just call your original function from multiple call points and let it manually do the setup/teardown explicitly. It still eliminates the copy-paste in that case.

I actually like your solution, and have coded something similar in test harnesses to check that a target function assert()s as expected.
I can't see any reason for this code to invoke undefined behaviour. The C Standard seems to bless it: handlers resulting from an abort() are exempted from the restriction on calling library functions from a handler. (Caveat: this is 7.14.1.1(5) of C99 - sadly, I don't have a copy of C90, the version referenced by the C++ Standard).
C++03 adds a further restriction: If any automatic objects would be destroyed by a thrown exception transferring control to another (destination) point in the program, then a call to longjmp(jbuf, val) at the throw point that transfers control to the same (destination) point has undefined behavior. I'm supposing that your statement that 'No variables are volatile or change between calls to setjmp/longjmp' includes instantiating any automatic C++ objects. (I guess this is some legacy C library?).
Nor is POSIX async signal safety (or lack thereof) an issue - abort() generates its SIGABRT synchronously with program execution.
The biggest concern would be corrupting the global state of the 3rd party code: it's unlikely that the author will take pains to get the state consistent before an abort(). But, if you're correct that no variables change, then this isn't a problem.
If someone with a better understanding of the standardese can prove me wrong, I'd appreciate the enlightenment.

What is the most efficient way to make this code thread safe?

Some C++ library I'm working on features a simple tracing mechanism which can be activated to generate log files showing which functions were called and what arguments were passed. It basically boils down to a TRACE macro being spilled all over the source of the library, and the macro expands to something like this:
typedef void(*TraceProc)( const char *msg );
/* Sets 'callback' to point to the trace procedure which actually prints the given
* message to some output channel, or to a null trace procedure which is a no-op when
* case the given source file/line position was disabled by the client.
*
* This function also registers the callback pointer in an internal data structure
* and resets it to zero in case the filtering configuration changed since the last
* invocation of updateTraceCallback.
*/
void updateTraceCallback( TraceProc *callback, const char *file, unsinged int lineno );
#define TRACE(msg) \
{ \
static TraceProc traceCallback = 0; \
if ( !traceCallback ) \
updateTraceCallback( &traceCallback, __FILE__, __LINE__ ); \
traceCallback( msg ); \
}
The idea is that people can just say TRACE("foo hit") in their code and that will either
call a debug printing function or it will be a no-op. They can use some other API (which is not shown here) to configure that only TRACE uses in locations (source file/line number) should be printed. This configuration can change at runtime.
The issue with this is that this idea should now be used in a multi-threaded code base. Hence, the code which TRACE expands to needs to work correctly in the face of multiple threads of execution running the code simultaneously. There are about 20.000 different trace points in the code base right now and they are hit very often, so they should be rather efficient
What is the most efficient way to make this approach thread safe? I need a solution for Windows (XP and newer) and Linux. I'm afraid of doing excessive locking just to check whether the filter configuration changed (99% of the time a trace point is hit, the configuration didn't change). I'm open to larger changes to the macro, too. So instead of discussing mutex vs. critical section performance, it would also be acceptable if the macro just sent an event to an event loop in a different thread (assuming that accessing the event loop is thread safe) and all the processing happens in the same thread, so it's synchronized using the event loop.
UPDATE: I can probably simplify this question to:
If I have one thread reading a pointer, and another thread which might write to the variable (but 99% of the time it doesn't), how can I avoid that the reading thread needs to lock all the time?

You could implement a configuration file version variable. When your program starts it is set to 0. The macro can hold a static int that is the last config version it saw. Then a simple atomic comparison between the last seen and the current config version will tell you if you need to do a full lock and re-call updateTraceCallback();.
That way, 99% of the time you'll only add an extra atomic op, or memory barrier or something simmilar, which is very cheap. 1% of the time, just do the full mutex thing, it shouldn't affect your performance in any noticeable way, if its only 1% of the time.
Edit:
Some .h file:
extern long trace_version;
Some .cpp file:
long trace_version = 0;
The macro:
#define TRACE(msg)
{
static long __lastSeenVersion = -1;
static TraceProc traceCallback = 0;
if ( !traceCallback || __lastSeenVersion != trace_version )
updateTraceCallback( &traceCallback, &__lastSeenVersion, __FILE__, __LINE__ );
traceCallback( msg );
}
The functions for incrementing a version and updates:
static long oldVersionRefcount = 0;
static long curVersionRefCount = 0;
void updateTraceCallback( TraceProc *callback, long &version, const char *file, unsinged int lineno ) {
if ( version != trace_version ) {
if ( InterlockedDecrement( oldVersionRefcount ) == 0 ) {
//....free resources.....
//...no mutex needed, since no one is using this,,,
}
//....aquire mutex and do stuff....
InterlockedIncrement( curVersionRefCount );
*version = trace_version;
//...release mutex...
}
}
void setNewTraceCallback( TraceProc *callback ) {
//...aquire mutex...
trace_version++; // No locks, mutexes or anything, this is atomic by itself.
while ( oldVersionRefcount != 0 ) { //..sleep? }
InterlockedExchange( &oldVersionRefcount, curVersionRefCount );
curVersionRefCount = 0;
//.... and so on...
//...release mutex...
Of course, this is very simplified, since if you need to upgrade the version and the oldVersionRefCount > 0, then you're in trouble; how to solve this is up to you, since it really depends on your problem. My guess is that in those situations, you could simply wait until the ref count is zero, since the amount of time that the ref count is incremented should be the time it takes to run the macro.

I still don't fully understand the question, so please correct me on anything I didn't get.
(I'm leaving out the backslashes.)
#define TRACE(msg)
{
static TraceProc traceCallback = NULL;
TraceProc localTraceCallback;
localTraceCallback = traceCallback;
if (!localTraceCallback)
{
updateTraceBallback(&localTraceCallback, __FILE__, __LINE__);
// If two threads are running this at the same time
// one of them will update traceCallback and get it overwritten
// by the other. This isn't a big deal.
traceCallback = localTraceCallback;
}
// Now there's no way localTraceCallback can be null.
// An issue here is if in the middle of this executing
// traceCallback gets null'ed. But you haven't specified any
// restrictions about this either, so I'm assuming it isn't a problem.
localTraceCallback(msg);
}

Your comment says "resets it to zero in case the filtering configuration changes at runtime" but am I correct in reading that as "resets it to zero when the filtering configuration changes"?
Without knowing exactly how updateTraceCallback implements its data structure, or what other data it's referring to in order to decide when to reset the callbacks (or indeed to set them in the first place), it's impossible to judge what would be safe. A similar problem applies to knowing what traceCallback does - if it accesses a shared output destination, for example.
Given these limitations the only safe recommendation that doesn't require reworking other code is to stick a mutex around the whole lot (or preferably a critical section on Windows).

I'm afraid of doing excessive locking just to check whether the filter configuration changed (99% of the time a trace point is hit, the configuration didn't change). I'm open to larger changes to the macro, too. So instead of discussing mutex vs. critical section performance, it would also be acceptable if the macro just sent an event to an event loop in a different thread (assuming that accessing the event loop is thread safe)
How do you think thread safe messaging between threads is implemented without locks?
Anyway, here's a design that might work:
The data structure that holds the filter must be changed so that it is allocated dynamically from the heap because we are going to be creating multiple instances of filters. Also, it's going to need a reference count added to it. You need a typedef something like:
typedef struct Filter
{
unsigned int refCount;
// all the other filter data
} Filter;
There's a singleton 'current filter' declared somewhere.
static Filter* currentFilter;
and initialised with some default settings.
In your TRACE macro:
#define TRACE(char* msg)
{
static Filter* filter = NULL;
static TraceProc traceCallback = NULL;
if (filterOutOfDate(filter))
{
getNewCallback(__FILE__, __LINE__, &traceCallback, &filter);
}
traceCallback(msg);
}
filterOutOfDate() merely compares the filter with currentFilter to see if it is the same. It should be enough to just compare addresses. It does no locking.
getNewCallback() applies the current filter to get the new trace function and updates the filter passed in with the address of the current filter. It's implementation must be protected with a mutex lock. Also, it decremetns the refCount of the original filter and increments the refCount of the new filter. This is so we know when we can free the old filter.
void getNewCallback(const char* file, int line, TraceProc* newCallback, Filter** filter)
{
// MUTEX lock
newCallback = // whatever you need to do
currentFilter->refCount++;
if (*filter != NULL)
{
*filter->refCount--;
if (*filter->refCount == 0)
{
// free filter and associated resources
}
}
*filter = currentFilter;
// MUTEX unlock
}
When you want to change the filter, you do something like
changeFilter()
{
Filter* newFilter = // build a new filter
newFilter->refCount = 0;
// MUTEX lock (same mutex as above)
currentFilter = newFilter;
// MUTEX unlock
}

If I have one thread reading a pointer, and another thread which might write to the variable (but 99% of the time it doesn't), how can I avoid that the reading thread needs to lock all the time?
From your code, it is OK to use the mutex inside the updateTraceCallback() since it is going to be called very rarely (once per location). After taking the mutex, check whether the traceCallback is already initialized: if yes, then other thread just did it for you and there is nothing to be done.
If updateTraceCallback() would turn out to be a serious performance problem due to the collisions on the global mutex, then you can simply make an array of mutexes instead and use hashed value of the traceCallback pointer as an index into the mutex array. That would spread locking over many mutexes and minimize number of collisions.

#define TRACE(msg) \
{ \
static TraceProc traceCallback = \
updateTraceBallback( &traceCallback, __FILE__, __LINE__ ); \
traceCallback( msg ); \
}

How can I schedule some code to run after all '_atexit()' functions are completed

I'm writing a memory tracking system and the only problem I've actually run into is that when the application exits, any static/global classes that didn't allocate in their constructor, but are deallocating in their deconstructor are deallocating after my memory tracking stuff has reported the allocated data as a leak.
As far as I can tell, the only way for me to properly solve this would be to either force the placement of the memory tracker's _atexit callback at the head of the stack (so that it is called last) or have it execute after the entire _atexit stack has been unwound. Is it actually possible to implement either of these solutions, or is there another solution that I have overlooked.
Edit:
I'm working on/developing for Windows XP and compiling with VS2005.

I've finally figured out how to do this under Windows/Visual Studio. Looking through the crt startup function again (specifically where it calls the initializers for globals), I noticed that it was simply running "function pointers" that were contained between certain segments. So with just a little bit of knowledge on how the linker works, I came up with this:
#include <iostream>
using std::cout;
using std::endl;
// Typedef for the function pointer
typedef void (*_PVFV)(void);
// Our various functions/classes that are going to log the application startup/exit
struct TestClass
{
int m_instanceID;
TestClass(int instanceID) : m_instanceID(instanceID) { cout << " Creating TestClass: " << m_instanceID << endl; }
~TestClass() {cout << " Destroying TestClass: " << m_instanceID << endl; }
};
static int InitInt(const char *ptr) { cout << " Initializing Variable: " << ptr << endl; return 42; }
static void LastOnExitFunc() { puts("Called " __FUNCTION__ "();"); }
static void CInit() { puts("Called " __FUNCTION__ "();"); atexit(&LastOnExitFunc); }
static void CppInit() { puts("Called " __FUNCTION__ "();"); }
// our variables to be intialized
extern "C" { static int testCVar1 = InitInt("testCVar1"); }
static TestClass testClassInstance1(1);
static int testCppVar1 = InitInt("testCppVar1");
// Define where our segment names
#define SEGMENT_C_INIT ".CRT$XIM"
#define SEGMENT_CPP_INIT ".CRT$XCM"
// Build our various function tables and insert them into the correct segments.
#pragma data_seg(SEGMENT_C_INIT)
#pragma data_seg(SEGMENT_CPP_INIT)
#pragma data_seg() // Switch back to the default segment
// Call create our call function pointer arrays and place them in the segments created above
#define SEG_ALLOCATE(SEGMENT) __declspec(allocate(SEGMENT))
SEG_ALLOCATE(SEGMENT_C_INIT) _PVFV c_init_funcs[] = { &CInit };
SEG_ALLOCATE(SEGMENT_CPP_INIT) _PVFV cpp_init_funcs[] = { &CppInit };
// Some more variables just to show that declaration order isn't affecting anything
extern "C" { static int testCVar2 = InitInt("testCVar2"); }
static TestClass testClassInstance2(2);
static int testCppVar2 = InitInt("testCppVar2");
// Main function which prints itself just so we can see where the app actually enters
void main()
{
cout << " Entered Main()!" << endl;
}
which outputs:
Called CInit();
Called CppInit();
Initializing Variable: testCVar1
Creating TestClass: 1
Initializing Variable: testCppVar1
Initializing Variable: testCVar2
Creating TestClass: 2
Initializing Variable: testCppVar2
Entered Main()!
Destroying TestClass: 2
Destroying TestClass: 1
Called LastOnExitFunc();
This works due to the way MS have written their runtime library. Basically, they've setup the following variables in the data segments:
(although this info is copyright I believe this is fair use as it doesn't devalue the original and IS only here for reference)
extern _CRTALLOC(".CRT$XIA") _PIFV __xi_a[];
extern _CRTALLOC(".CRT$XIZ") _PIFV __xi_z[]; /* C initializers */
extern _CRTALLOC(".CRT$XCA") _PVFV __xc_a[];
extern _CRTALLOC(".CRT$XCZ") _PVFV __xc_z[]; /* C++ initializers */
extern _CRTALLOC(".CRT$XPA") _PVFV __xp_a[];
extern _CRTALLOC(".CRT$XPZ") _PVFV __xp_z[]; /* C pre-terminators */
extern _CRTALLOC(".CRT$XTA") _PVFV __xt_a[];
extern _CRTALLOC(".CRT$XTZ") _PVFV __xt_z[]; /* C terminators */
On initialization, the program simply iterates from '__xN_a' to '__xN_z' (where N is {i,c,p,t}) and calls any non null pointers it finds. If we just insert our own segment in between the segments '.CRT$XnA' and '.CRT$XnZ' (where, once again n is {I,C,P,T}), it will be called along with everything else that normally gets called.
The linker simply joins up the segments in alphabetical order. This makes it extremely simple to select when our functions should be called. If you have a look in defsects.inc (found under $(VS_DIR)\VC\crt\src\) you can see that MS have placed all the "user" initialization functions (that is, the ones that initialize globals in your code) in segments ending with 'U'. This means that we just need to place our initializers in a segment earlier than 'U' and they will be called before any other initializers.
You must be really careful not to use any functionality that isn't initialized until after your selected placement of the function pointers (frankly, I'd recommend you just use .CRT$XCT that way its only your code that hasn't been initialized. I'm not sure what will happen if you've linked with standard 'C' code, you may have to place it in the .CRT$XIT block in that case).
One thing I did discover was that the "pre-terminators" and "terminators" aren't actually stored in the executable if you link against the DLL versions of the runtime library. Due to this, you can't really use them as a general solution. Instead, the way I made it run my specific function as the last "user" function was to simply call atexit() within the 'C initializers', this way, no other function could have been added to the stack (which will be called in the reverse order to which functions are added and is how global/static deconstructors are all called).
Just one final (obvious) note, this is written with Microsoft's runtime library in mind. It may work similar on other platforms/compilers (hopefully you'll be able to get away with just changing the segment names to whatever they use, IF they use the same scheme) but don't count on it.

atexit is processed by the C/C++ runtime (CRT). It runs after main() has already returned. Probably the best way to do this is to replace the standard CRT with your own.
On Windows tlibc is probably a great place to start: http://www.codeproject.com/KB/library/tlibc.aspx
Look at the code sample for mainCRTStartup and just run your code after the call to _doexit();
but before ExitProcess.
Alternatively, you could just get notified when ExitProcess gets called. When ExitProcess gets called the following occurs (according to http://msdn.microsoft.com/en-us/library/ms682658%28VS.85%29.aspx):
All of the threads in the process, except the calling thread, terminate their execution without receiving a DLL_THREAD_DETACH notification.
The states of all of the threads terminated in step 1 become signaled.
The entry-point functions of all loaded dynamic-link libraries (DLLs) are called with DLL_PROCESS_DETACH.
After all attached DLLs have executed any process termination code, the ExitProcess function terminates the current process, including the calling thread.
The state of the calling thread becomes signaled.
All of the object handles opened by the process are closed.
The termination status of the process changes from STILL_ACTIVE to the exit value of the process.
The state of the process object becomes signaled, satisfying any threads that had been waiting for the process to terminate.
So, one method would be to create a DLL and have that DLL attach to the process. It will get notified when the process exits, which should be after atexit has been processed.
Obviously, this is all rather hackish, proceed carefully.

This is dependent on the development platform. For example, Borland C++ has a #pragma which could be used for exactly this. (From Borland C++ 5.0, c. 1995)
#pragma startup function-name [priority]
#pragma exit function-name [priority]
These two pragmas allow the program to specify function(s) that should be called either upon program startup (before the main function is called), or program exit (just before the program terminates through _exit).
The specified function-name must be a previously declared function as:
void function-name(void);
The optional priority should be in the range 64 to 255, with highest priority at 0; default is 100. Functions with higher priorities are called first at startup and last at exit. Priorities from 0 to 63 are used by the C libraries, and should not be used by the user.
Perhaps your C compiler has a similar facility?

I've read multiple times you can't guarantee the construction order of global variables (cite). I'd think it is pretty safe to infer from this that destructor execution order is also not guaranteed.
Therefore if your memory tracking object is global, you will almost certainly be unable any guarantees that your memory tracker object will get destructed last (or constructed first). If it's not destructed last, and other allocations are outstanding, then yes it will notice the leaks you mention.
Also, what platform is this _atexit function defined for?

Having the memory tracker's cleanup executed last is the best solution. The easiest way I've found to do that is to explicitly control all the relevant global variables' initialization order. (Some libraries hide their global state in fancy classes or otherwise, thinking they're following a pattern, but all they do is prevent this kind of flexibility.)
Example main.cpp:
#include "global_init.inc"
int main() {
// do very little work; all initialization, main-specific stuff
// then call your application's mainloop
}
Where the global-initialization file includes object definitions and #includes similar non-header files. Order the objects in this file in the order you want them constructed, and they'll be destructed in the reverse order. 18.3/8 in C++03 guarantees that destruction order mirrors construction: "Non-local objects with static storage duration are destroyed in the reverse order of the completion of their constructor." (That section is talking about exit(), but a return from main is the same, see 3.6.1/5.)
As a bonus, you're guaranteed that all globals (in that file) are initialized before entering main. (Something not guaranteed in the standard, but allowed if implementations choose.)

I've had this exact problem, also writing a memory tracker.
A few things:
Along with destruction, you also need to handle construction. Be prepared for malloc/new to be called BEFORE your memory tracker is constructed (assuming it is written as a class). So you need your class to know whether it has been constructed or destructed yet!
class MemTracker
{
enum State
{
unconstructed = 0, // must be 0 !!!
constructed,
destructed
};
State state;
MemTracker()
{
if (state == unconstructed)
{
// construct...
state = constructed;
}
}
};
static MemTracker memTracker; // all statics are zero-initted by linker
On every allocation that calls into your tracker, construct it!
MemTracker::malloc(...)
{
// force call to constructor, which does nothing after first time
new (this) MemTracker();
...
}
Strange, but true. Anyhow, onto destruction:
~MemTracker()
{
OutputLeaks(file);
state = destructed;
}
So, on destruction, output your results. Yet we know that there will be more calls. What to do? Well,...
MemTracker::free(void * ptr)
{
do_tracking(ptr);
if (state == destructed)
{
// we must getting called late
// so re-output
// Note that this might happen a lot...
OutputLeaks(file); // again!
}
}
And lastly:
be careful with threading
be careful not to call malloc/free/new/delete inside your tracker, or be able to detect the recursion, etc :-)
EDIT:
and I forgot, if you put your tracker in a DLL, you will probably need to LoadLibrary() (or dlopen, etc) yourself to up your reference count, so that you don't get removed from memory prematurely. Because although your class can still be called after destruction, it can't if the code has been unloaded.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js