Possible stack corruption inside try __finally block - c++

I am having problems with stack corruption in a new module I am working on which is part of a large legacy project. My code is written in C++ using Borland C++Builder 5.0.
I have tracked the problem to the following function:
// Note: Class TMarshalServerClientThread has the following objects defined
// CRITICAL_SECTION FCriticalSection;
// std::vector<TMarshalTagInfo*> FTagChangeQueue;
void __fastcall TMarshalServerClientThread::SendChangeNotifications()
{
EnterCriticalSection(FCriticalSection);
try {
if (FTagChangeQueue.size() == 0) {
return;
}
// Process items in change queue
FTagChangeQueue.clear();
} __finally {
LeaveCriticalSection(FCriticalSection);
}
}
This function is called in the context of a worker thread (which descends from TThread). A different thread populates the change queue with data as it becomes available. The change queue is protected by a critical section object.
When the code is run, I sporadically get access violations when attempting to leave the critical section. From what I can tell, sometimes when the __finally section is entered, the stack is corrupted. The class instance on the heap is fine, but the pointers to the class (such as the "this" pointer) appear to be invalid.
If I remove the call to return if the change queue is empty, the problem goes away. Additionally, the code to process the items in the queue is not the source of the problem, as I can comment it out and the problem remains.
So my question is are there known issues when using __finally in C++Builder 5? Is it wrong to call return from within a try __finally block? If so, why?
Please note that I realize that there are different/better ways to do what I am doing, and I am refactoring as such. However, I fail to see why this codes should be causing stack corruption.

As #duDE pointed, you should use pair of __try, __finally instead of intermixing C++ try, and Borland extension __finally.

I know it is a long time after the original question was posted, but as a warning to others, I can vouch for the symptom that Jonathan Wiens is reporting. I experienced it with Builder XE4. It does not happen frequently, but it seems that Borland/Embarcadero's implementation of try / finally blocks in a multi-threaded process very occasionally corrupts the stack. I was also using critical sections, although that may be coincidental.
I was able to resolve my problem by discarding the try / finally. I was fortunate that I was only deleting class instances in the finally block, so I was able to replace the try / finally with scope braces, using std::auto_ptr fields to delete the objects in question.

Related

How can two threads execute code inside the same critical section, when this should never happen?

I am analyzing an application crash dump on Windows 7 SP1. I can see two threads that are running code inside the same CRITICAL_SECTION at the time of the crash, including the crashing thread. As far as I know, this is never supposed to happen.
Here is what I have been able to determine so far:
The critical section in question is a data member in one of our C++ classes. It is initialized in the constructor with InitializeCriticalSection() (first operation in the constructor code block), and destroyed in the destructor with DeleteCriticalSection() (last operation in the destructor).
Both threads are running code on the same instance of the class, therefore using the same critical section.
We use two-step initialization for this object, with an initialize method separate from the constructor that can be called at a later time. Both threads are inside that initialization method at the time of the crash. See the code sample below.
Looking at the CRITICAL_SECTION data in the debugger, it appears that it is NOT locked (LockCount equal to -2, which is consistent with an unlocked critical section since the behavor was redesiged in Windows Vista). I don't see how that could be.
I have also confirmed that all calls to EnterCriticalSection() and LeaveCriticalSection() in the class are balanced; there is no possibility that one thread would have left the critical section early.
Here is what the initialization method looks like (not the actual code, but the same relevant structure):
bool MyClass::Initialize()
{
EnterCriticalSection(&m_Lock);
if (!m_Initialized)
{
// Lengthy initialization code here...
// Both threads are somewhere in here when the crash happens.
m_Initialized = true;
}
LeaveCriticalSection(&m_Lock);
return m_Initialized;
}
This code has not changed in the past 15 years, and was running fine until that recent crash.
A few things that may be different on the crashing systems compared to the other systems where it usually runs:
The crash occurred on a virtual machine. Systems using that code are usually deployed on a dedicated physical machine.
The code itself is in an unmanaged dll, and the crashed happened in a managed application that uses it; i.e. in a mixed environment (managed code calling unmanaged code). This is fairly recent (a few years at most), for most of its life, that code was used only by unmanaged applications.
EDIT:
As requested in the comment, here is what the constructor and destructor look like:
MyClass::MyClass()
{
InitializeCriticalSection(&m_Lock);
}
MyClass::~MyClass()
{
Uninitialize();
DeleteCriticalSection(&m_Lock);
}
Note also that m_Lock is a data member of MyClass (not static).
I have also looked at the registers for the stack frame of the call to the initialize method, and traced the values back to the method entry point in disassembly view. I was able to confirm that the value of the "this" pointer, which appears in RDI at the point of the crash, was moved there from RCX at the beginning of the method. This is consistent with "this" being the implicit first argument passed to the method according to the x64 __fastcall calling convention.
I also traced the point where the Initialize method is called in both threads (this is in another project that uses our dlls as dependencies, so I had to find the pdbs and pulled the code from another repository).
It is done directly into a thread function, i.e. passed as the thread entry point to a CreateThread() call. It looks like this:
static DWORD MyThread(void* a_pParam)
{
ThreadParamContainer* pContainer = static_cast<ReturnContainer*>(a_pParam);
pContainer->pReturnObject = new MyClass;
if (pContainer->pReturnObject->Initialize() == false)
{
delete pContainer->pReturnObject;
pContainer->pReturnObject = NULL;
}
return 0;
}
So the goal of this thread is simply to create the object, initialize it, and return it to the thread's creator. As the initialization of that particular object can take some time (it involves connecting to a remote server, among other things), the code that spawns the thread can do other things while this initialization takes place.
If two threads were created with the same ThreadParamContainer, there could be a memory leak, since the second thread to call "new" would overwrite the pointer. But this still does not explain how both threads could end up in the same critical section.
This is still all in unmanaged code.

Returning from Function changes context to NULL

I have three classes relevant to this issue. I'm implementing a hardware service for an application. PAPI (Platform API) is a hardware service class that keeps track of various hardware interfaces. I have implemented an abstract HardwareInterface class, and a class that derives it called HardwareWinUSB.
Below are examples similar to what I've done. I've left out members that don't appear to be relevant to this issue, like functions to open the USB connection:
class PAPI {
HardwareInterface *m_pHardware;
PAPI() {
m_pHardware = new HardwareWinUSB();
}
~PAPI() {
delete m_pHardware;
}
ERROR_CODE WritePacket(void* WriteBuf)
{
return m_pHardware->write( WriteBuf);
}
};
class HardwareInterface {
virtual ERROR_CODE write( void* WriteBuf) = 0;
};
class HardwareWinUSB : public HardwareInterface
{
ERROR_CODE write( void* Params)
{
// Some USB writing code.
// This had worked just fine before attempting to refactor
// Into this more sustainable hardware management scheme
{
};
I've been wrestling with this for several hours now. It's a strange, reproducible issue, but is sometimes intermittent. If I step through the debugger at a higher context, things execute well. If I don't dig deep enough, I'm met with an error that reads
Exception thrown at 0x00000000 in <ProjectName.exe>: 0xC0000005: Access violation executing location 0x00000000
If I dig down into the PAPI code, I see bizarre behavior.
When I set a breakpoint in the body of WritePacket, everything appears normal. Then I do a "step over" in the debugger. After the return from the function call, my reference to 'this' is set to 0x00000000.
What is going on? It looks like a null value was pushed on the return stack? Has anyone seen something like this happen before? Am I using virtual methods incorrectly?
edit
After further dissection, I found that I was reading before calling write, and the buffer that I was reading into was declared in local scope. When new reads came in, they were being pushed into the stack, corrupting it. The next function called, write, would return to a destroyed stack.
A buffer overrun can trash the return address on the stack. You seem to be reading and writing packets with void pointers and without passing around explicit sizes, so a simple overrun bug seems quite likely. The Visual Studio compiler has options to add stack integrity checks to detect these kinds of bugs, but they're not 100% perfect. Nonetheless, make sure you have them switched on.
Also note that the Visual Studio debugger can occasionally (but rarely) show the wrong value for this, especially if you're trying to debug optimized code. If you're at the } at the end of a method, I wouldn't necessarily worry about the debugger showing a bizarre value for this.
After further dissection, I found that I was reading before calling write, and the buffer that I was reading into was declared in local scope (in the read function).
When new reads came in, they were being pushed into the stack, corrupting it. The next function I called, write, would return to a destroyed stack.

Howto debug double deletes in C++?

I'm maintaining a legacy application written in C++. It crashes every now and then and Valgrind tells me its a double delete of some object.
What are the best ways to find the bug that is causing a double delete in an application you don't fully understand and which is too large to be rewritten ?
Please share your best tips and tricks!
Here's some general suggestion's that have helped me in that situation:
Turn your logging level up to full debug, if you are using a logger. Look for suspicious stuff in the output. If your app doesn't log pointer allocations and deletes of the object/class under suspicion, it's time to insert some cout << "class Foo constructed, ptr= " << this << endl; statements in your code (and corresponding delete/destructor prints).
Run valgrind with --db-attach=yes. I've found this very handy, if a bit tedious. Valgrind will show you a stack trace every time it detects a significant memory error or event and then ask you if you want to debug it. You may find yourself repeatedly pressing 'n' many many times if your app is large, but keep looking for the line of code where the object in question is first (and secondly) deleted.
Just scour the code. Look for construction/deletion of the object in question. Sadly, sometimes it winds up being in a 3rd party library :-(.
Update: Just found this out recently: Apparently gcc 4.8 and later (if you can use GCC on your system) has some new built-in features for detecting memory errors, the "address sanitizer". Also available in the LLVM compiler system.
Yep. What #OliCharlesworth said. There's no surefire way of testing a pointer to see if it points to allocated memory, since it really is just the memory location itself.
The biggest problem your question implies is the lack of reproducability. Continuing with that in mind, you're stuck with changing simple 'delete' constructs to delete foo;foo = NULL;.
Even then the best case scenario is "it seems to occur less" until you've really stamped it down.
I'd also ask by what evidence Valgrind suggests it's a double-delete problem. Might be a better clue lingering around in there.
It's one of the simpler truly nasty problems.
This may or may not work for you.
Long time ago I was working on 1M+ lines program that was 15 years old at the time. Faced with the exact same problem - double delete with huge data set. With such data any out of the box "memory profiler" would be a no go.
Things that were on my side:
It was very reproducible - we had macro language and running same script exactly the same way reproduced it every time
Sometime during the history of the project someone decided that "#define malloc my_malloc" and "#define free my_free" had some use. These didn't do much more than call built-in malloc() and free() but project already compiled and worked this way.
Now the trick/idea:
my_malloc(int size)
{
static int allocation_num = 0; // it was single threaded
void* p = builtin_malloc(size+16);
*(int*)p = ++allocation_num;
*((char*)p+sizeof(int)) = 0; // not freed
return (char*)p+16; // check for NULL in order here
}
my_free(void* p)
{
if (*((char*)p+sizeof(int)))
{
// this is double free, check allocation_number
// then rerun app with this in my_alloc
// if (alloc_num == XXX) debug_break();
}
*((char*)p+sizeof(int)) = 1; // freed
//built_in_free((char*)p-16); // do not do this until problem is figured out
}
With new/delete it might be trickier, but still with LD_PRELOAD you might be able to replace malloc/free without even recompiling your app.
you are probably upgrading from a version that treated delete differently then the new version.
probably what the previous version did was when delete was called it did a static check for if (X != NULL){ delete X; X = NULL;} and then in the new version it just does the delete action.
you might need to go through and check for pointer assignments, and tracking references of object names from construction to deletion.
I've found this useful: backtrace() on linux. (You have to compile with -rdynamic.) This lets you find out where that double free is coming from by putting a try/catch block around all memory operations (new/delete) then in the catch block, print out your stack trace.
This way you can narrow down the suspects much faster than running valgrind.
I wrapped backtrace in a handy little class so that I can just say:
try {
...
} catch (...) {
StackTrace trace;
std::cerr << "Double free!!!\n" << trace << std::endl;
throw;
}
On Windows, assuming the app is built with MSVC++, you can take advantage of the extensive heap debugging tools built into the debug version of the standard library.
Also on Windows, you can use Application Verifier. If I recall correctly, it has a mode the forces each allocation onto a separate page with protected guard pages in between. It's very effective at finding buffer overruns, but I suspect it would also be useful for a double-free situation.
Another thing you could do (on any platform) would be to make a copy of the sources that are transformed (perhaps with macros) so that every instance of:
delete foo;
is replaced with:
{ delete foo; foo = nullptr; }
(The braces help in many cases, though it's not perfect.) That will turn many instances of double-free into a null pointer reference, making it much easier to detect. It doesn't catch everything; you might have a copy of a stale pointer, but it can help squash a lot of the common use-after-delete scenarios.

Locating memory errors in a C++ GUI

Or at least I think the problem involves some kind of memory error. I'm making a program in SFML and I'm currently working on the menus using a GUI class that I made just for SFML. Internally, the GUI class uses std::shared_ptr to manage all of its internal pointers. The program consistently crashes after main() exits and all global destructors have been called, and gdb says a break point was triggered in ntdll!WaitForAlpCompletion, which leads me to believe that the problem is memory corruption. Whenever I remove the GUI instantiation from the menu function, it exits and closes with no errors. This seems to indicate GUI as the cause of the crash, except that sub-menus which create and destroy their own instances of GUI can be called and exited without any crashes or break points.
Some psuedocode:
SubMenu
{
Create GUI
Do Menu
Destroy GUI
}
Menu
{
Create GUI
Do Menu?SubMenu
Destroy GUI
}
main
{
Init Stuff
Menu
UnInit Stuff
Destroy GUI
return 0
}
//after return
Global Dtors
Breakpoint triggered???
I'm at a loss as to what this could be. I plan on using some memory debugger like valgrind sometime today, but I was wondering if anyone else had any ideas on what this could be.
Finally figured it out!!!!! It turns out that std::map calls the destructors of its objects every time it is re-sized, causing the shared_ptr's within to delete their data several times. A few "quick" design changes and fixed :) Thanks guys!
A heap corruption can be caused with this code:
int main()
{
int *A(new(std::nothrow) int(10));
int *B(A);
delete B;
delete A;
}
Does any of your code contain this similar situation?

My code crashes on delete this

I get a segmentation fault when attempting to delete this.
I know what you think about delete this, but it has been left over by my predecessor. I am aware of some precautions I should take, which have been validated and taken care of.
I don't get what kind of conditions might lead to this crash, only once in a while. About 95% of the time the code runs perfectly fine but sometimes this seems to be corrupted somehow and crash.
The destructor of the class doesn't do anything btw.
Should I assume that something is corrupting my heap somewhere else and that the this pointer is messed up somehow?
Edit : As requested, the crashing code:
long CImageBuffer::Release()
{
long nRefCount = InterlockedDecrement(&m_nRefCount);
if(nRefCount == 0)
{
delete this;
}
return nRefCount;
}
The object has been created with a new, it is not in any kind of array.
The most obvious answer is : don't delete this.
If you insists on doing that, then use common ways of finding bugs :
1. use valgrind (or similar tool) to find memory access problems
2. write unit tests
3. use debugger (prepare for loooong staring at the screen - depends on how big your project is)
It seems like you've mismatched new and delete. Note that delete this; can only be used on an object which was allocated using new (and in case of overridden operator new, or multiple copies of the C++ runtime, the particular new that matches delete found in the current scope)
Crashes upon deallocation can be a pain: It is not supposed to happen, and when it happens, the code is too complicated to easily find a solution.
Note: The use of InterlockedDecrement have me assume you are working on Windows.
Log everything
My own solution was to massively log the construction/destruction, as the crash could well never happen while debugging:
Log the construction, including the this pointer value, and other relevant data
Log the destruction, including the this pointer value, and other relevant data
This way, you'll be able to see if the this was deallocated twice, or even allocated at all.
... everything, including the stack
My problem happened in Managed C++/.NET code, meaning that I had easy access to the stack, which was a blessing. You seem to work on plain C++, so retrieving the stack could be a chore, but still, it remains very very useful.
You should try to load code from internet to print out the current stack for each log. I remember playing with http://www.codeproject.com/KB/threads/StackWalker.aspx for that.
Note that you'll need to either be in debug build, or have the PDB file along the executable file, to make sure the stack will be fully printed.
... everything, including multiple crashes
I believe you are on Windows: You could try to catch the SEH exception. This way, if multiple crashes are happening, you'll see them all, instead of seeing only the first, and each time you'll be able to mark "OK" or "CRASHED" in your logs. I went even as far as using maps to remember addresses of allocations/deallocations, thus organizing the logs to show them together (instead of sequentially).
I'm at home, so I can't provide you with the exact code, but here, Google is your friend, but the thing to remember is that you can't have a __try/__except handdler everywhere (C++ unwinding and C++ exception handlers are not compatible with SEH), so you'll have to write an intermediary function to catch the SEH exception.
Is your crash thread-related?
Last, but not least, the "I happens only 5% of the time" symptom could be caused by different code path executions, or the fact you have multiple threads playing together with the same data.
The InterlockedDecrement part bothers me: Is your object living in multiple threads? And is m_nRefCount correctly aligned and volatile LONG?
The correctly aligned and LONG part are important, here.
If your variable is not a LONG (for example, it could be a size_t, which is not a LONG on a 64-bit Windows), then the function could well work the wrong way.
The same can be said for a variable not aligned on 32-byte boundaries. Is there #pragma pack() instructions in your code? Does your projet file change the default alignment (I assume you're working on Visual Studio)?
For the volatile part, InterlockedDecrement seem to generate a Read/Write memory barrier, so the volatile part should not be mandatory (see http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx).