I have three classes relevant to this issue. I'm implementing a hardware service for an application. PAPI (Platform API) is a hardware service class that keeps track of various hardware interfaces. I have implemented an abstract HardwareInterface class, and a class that derives it called HardwareWinUSB.
Below are examples similar to what I've done. I've left out members that don't appear to be relevant to this issue, like functions to open the USB connection:
class PAPI {
HardwareInterface *m_pHardware;
PAPI() {
m_pHardware = new HardwareWinUSB();
}
~PAPI() {
delete m_pHardware;
}
ERROR_CODE WritePacket(void* WriteBuf)
{
return m_pHardware->write( WriteBuf);
}
};
class HardwareInterface {
virtual ERROR_CODE write( void* WriteBuf) = 0;
};
class HardwareWinUSB : public HardwareInterface
{
ERROR_CODE write( void* Params)
{
// Some USB writing code.
// This had worked just fine before attempting to refactor
// Into this more sustainable hardware management scheme
{
};
I've been wrestling with this for several hours now. It's a strange, reproducible issue, but is sometimes intermittent. If I step through the debugger at a higher context, things execute well. If I don't dig deep enough, I'm met with an error that reads
Exception thrown at 0x00000000 in <ProjectName.exe>: 0xC0000005: Access violation executing location 0x00000000
If I dig down into the PAPI code, I see bizarre behavior.
When I set a breakpoint in the body of WritePacket, everything appears normal. Then I do a "step over" in the debugger. After the return from the function call, my reference to 'this' is set to 0x00000000.
What is going on? It looks like a null value was pushed on the return stack? Has anyone seen something like this happen before? Am I using virtual methods incorrectly?
edit
After further dissection, I found that I was reading before calling write, and the buffer that I was reading into was declared in local scope. When new reads came in, they were being pushed into the stack, corrupting it. The next function called, write, would return to a destroyed stack.
A buffer overrun can trash the return address on the stack. You seem to be reading and writing packets with void pointers and without passing around explicit sizes, so a simple overrun bug seems quite likely. The Visual Studio compiler has options to add stack integrity checks to detect these kinds of bugs, but they're not 100% perfect. Nonetheless, make sure you have them switched on.
Also note that the Visual Studio debugger can occasionally (but rarely) show the wrong value for this, especially if you're trying to debug optimized code. If you're at the } at the end of a method, I wouldn't necessarily worry about the debugger showing a bizarre value for this.
After further dissection, I found that I was reading before calling write, and the buffer that I was reading into was declared in local scope (in the read function).
When new reads came in, they were being pushed into the stack, corrupting it. The next function I called, write, would return to a destroyed stack.
Related
I am analyzing an application crash dump on Windows 7 SP1. I can see two threads that are running code inside the same CRITICAL_SECTION at the time of the crash, including the crashing thread. As far as I know, this is never supposed to happen.
Here is what I have been able to determine so far:
The critical section in question is a data member in one of our C++ classes. It is initialized in the constructor with InitializeCriticalSection() (first operation in the constructor code block), and destroyed in the destructor with DeleteCriticalSection() (last operation in the destructor).
Both threads are running code on the same instance of the class, therefore using the same critical section.
We use two-step initialization for this object, with an initialize method separate from the constructor that can be called at a later time. Both threads are inside that initialization method at the time of the crash. See the code sample below.
Looking at the CRITICAL_SECTION data in the debugger, it appears that it is NOT locked (LockCount equal to -2, which is consistent with an unlocked critical section since the behavor was redesiged in Windows Vista). I don't see how that could be.
I have also confirmed that all calls to EnterCriticalSection() and LeaveCriticalSection() in the class are balanced; there is no possibility that one thread would have left the critical section early.
Here is what the initialization method looks like (not the actual code, but the same relevant structure):
bool MyClass::Initialize()
{
EnterCriticalSection(&m_Lock);
if (!m_Initialized)
{
// Lengthy initialization code here...
// Both threads are somewhere in here when the crash happens.
m_Initialized = true;
}
LeaveCriticalSection(&m_Lock);
return m_Initialized;
}
This code has not changed in the past 15 years, and was running fine until that recent crash.
A few things that may be different on the crashing systems compared to the other systems where it usually runs:
The crash occurred on a virtual machine. Systems using that code are usually deployed on a dedicated physical machine.
The code itself is in an unmanaged dll, and the crashed happened in a managed application that uses it; i.e. in a mixed environment (managed code calling unmanaged code). This is fairly recent (a few years at most), for most of its life, that code was used only by unmanaged applications.
EDIT:
As requested in the comment, here is what the constructor and destructor look like:
MyClass::MyClass()
{
InitializeCriticalSection(&m_Lock);
}
MyClass::~MyClass()
{
Uninitialize();
DeleteCriticalSection(&m_Lock);
}
Note also that m_Lock is a data member of MyClass (not static).
I have also looked at the registers for the stack frame of the call to the initialize method, and traced the values back to the method entry point in disassembly view. I was able to confirm that the value of the "this" pointer, which appears in RDI at the point of the crash, was moved there from RCX at the beginning of the method. This is consistent with "this" being the implicit first argument passed to the method according to the x64 __fastcall calling convention.
I also traced the point where the Initialize method is called in both threads (this is in another project that uses our dlls as dependencies, so I had to find the pdbs and pulled the code from another repository).
It is done directly into a thread function, i.e. passed as the thread entry point to a CreateThread() call. It looks like this:
static DWORD MyThread(void* a_pParam)
{
ThreadParamContainer* pContainer = static_cast<ReturnContainer*>(a_pParam);
pContainer->pReturnObject = new MyClass;
if (pContainer->pReturnObject->Initialize() == false)
{
delete pContainer->pReturnObject;
pContainer->pReturnObject = NULL;
}
return 0;
}
So the goal of this thread is simply to create the object, initialize it, and return it to the thread's creator. As the initialization of that particular object can take some time (it involves connecting to a remote server, among other things), the code that spawns the thread can do other things while this initialization takes place.
If two threads were created with the same ThreadParamContainer, there could be a memory leak, since the second thread to call "new" would overwrite the pointer. But this still does not explain how both threads could end up in the same critical section.
This is still all in unmanaged code.
I am having problems with stack corruption in a new module I am working on which is part of a large legacy project. My code is written in C++ using Borland C++Builder 5.0.
I have tracked the problem to the following function:
// Note: Class TMarshalServerClientThread has the following objects defined
// CRITICAL_SECTION FCriticalSection;
// std::vector<TMarshalTagInfo*> FTagChangeQueue;
void __fastcall TMarshalServerClientThread::SendChangeNotifications()
{
EnterCriticalSection(FCriticalSection);
try {
if (FTagChangeQueue.size() == 0) {
return;
}
// Process items in change queue
FTagChangeQueue.clear();
} __finally {
LeaveCriticalSection(FCriticalSection);
}
}
This function is called in the context of a worker thread (which descends from TThread). A different thread populates the change queue with data as it becomes available. The change queue is protected by a critical section object.
When the code is run, I sporadically get access violations when attempting to leave the critical section. From what I can tell, sometimes when the __finally section is entered, the stack is corrupted. The class instance on the heap is fine, but the pointers to the class (such as the "this" pointer) appear to be invalid.
If I remove the call to return if the change queue is empty, the problem goes away. Additionally, the code to process the items in the queue is not the source of the problem, as I can comment it out and the problem remains.
So my question is are there known issues when using __finally in C++Builder 5? Is it wrong to call return from within a try __finally block? If so, why?
Please note that I realize that there are different/better ways to do what I am doing, and I am refactoring as such. However, I fail to see why this codes should be causing stack corruption.
As #duDE pointed, you should use pair of __try, __finally instead of intermixing C++ try, and Borland extension __finally.
I know it is a long time after the original question was posted, but as a warning to others, I can vouch for the symptom that Jonathan Wiens is reporting. I experienced it with Builder XE4. It does not happen frequently, but it seems that Borland/Embarcadero's implementation of try / finally blocks in a multi-threaded process very occasionally corrupts the stack. I was also using critical sections, although that may be coincidental.
I was able to resolve my problem by discarding the try / finally. I was fortunate that I was only deleting class instances in the finally block, so I was able to replace the try / finally with scope braces, using std::auto_ptr fields to delete the objects in question.
I get a segmentation fault when attempting to delete this.
I know what you think about delete this, but it has been left over by my predecessor. I am aware of some precautions I should take, which have been validated and taken care of.
I don't get what kind of conditions might lead to this crash, only once in a while. About 95% of the time the code runs perfectly fine but sometimes this seems to be corrupted somehow and crash.
The destructor of the class doesn't do anything btw.
Should I assume that something is corrupting my heap somewhere else and that the this pointer is messed up somehow?
Edit : As requested, the crashing code:
long CImageBuffer::Release()
{
long nRefCount = InterlockedDecrement(&m_nRefCount);
if(nRefCount == 0)
{
delete this;
}
return nRefCount;
}
The object has been created with a new, it is not in any kind of array.
The most obvious answer is : don't delete this.
If you insists on doing that, then use common ways of finding bugs :
1. use valgrind (or similar tool) to find memory access problems
2. write unit tests
3. use debugger (prepare for loooong staring at the screen - depends on how big your project is)
It seems like you've mismatched new and delete. Note that delete this; can only be used on an object which was allocated using new (and in case of overridden operator new, or multiple copies of the C++ runtime, the particular new that matches delete found in the current scope)
Crashes upon deallocation can be a pain: It is not supposed to happen, and when it happens, the code is too complicated to easily find a solution.
Note: The use of InterlockedDecrement have me assume you are working on Windows.
Log everything
My own solution was to massively log the construction/destruction, as the crash could well never happen while debugging:
Log the construction, including the this pointer value, and other relevant data
Log the destruction, including the this pointer value, and other relevant data
This way, you'll be able to see if the this was deallocated twice, or even allocated at all.
... everything, including the stack
My problem happened in Managed C++/.NET code, meaning that I had easy access to the stack, which was a blessing. You seem to work on plain C++, so retrieving the stack could be a chore, but still, it remains very very useful.
You should try to load code from internet to print out the current stack for each log. I remember playing with http://www.codeproject.com/KB/threads/StackWalker.aspx for that.
Note that you'll need to either be in debug build, or have the PDB file along the executable file, to make sure the stack will be fully printed.
... everything, including multiple crashes
I believe you are on Windows: You could try to catch the SEH exception. This way, if multiple crashes are happening, you'll see them all, instead of seeing only the first, and each time you'll be able to mark "OK" or "CRASHED" in your logs. I went even as far as using maps to remember addresses of allocations/deallocations, thus organizing the logs to show them together (instead of sequentially).
I'm at home, so I can't provide you with the exact code, but here, Google is your friend, but the thing to remember is that you can't have a __try/__except handdler everywhere (C++ unwinding and C++ exception handlers are not compatible with SEH), so you'll have to write an intermediary function to catch the SEH exception.
Is your crash thread-related?
Last, but not least, the "I happens only 5% of the time" symptom could be caused by different code path executions, or the fact you have multiple threads playing together with the same data.
The InterlockedDecrement part bothers me: Is your object living in multiple threads? And is m_nRefCount correctly aligned and volatile LONG?
The correctly aligned and LONG part are important, here.
If your variable is not a LONG (for example, it could be a size_t, which is not a LONG on a 64-bit Windows), then the function could well work the wrong way.
The same can be said for a variable not aligned on 32-byte boundaries. Is there #pragma pack() instructions in your code? Does your projet file change the default alignment (I assume you're working on Visual Studio)?
For the volatile part, InterlockedDecrement seem to generate a Read/Write memory barrier, so the volatile part should not be mandatory (see http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx).
When I throw in a method A, it causes buffer overrun but when I return, it runs fine.
I thought throw moves execution to the caller method so the address it goes to should be the same as return address, but i am obviuosly wrong.
Is there a way to see what address throw goes to in Visual Studio debugger?
Thank you
Berkus:
does this mean that stack of upper caller method is corrupted? So for example,
Method A calls
Method B calls
Method C. Method C throws an exception
Then, it is possible that Method C's return address is ok but Method B's return address is corrupted, causing buffer overrun? What I am seeing is that if there is no throw, my application runs fine, so I think Method A,B and C all have valid return addresses.
Throw will unwind the stack, until it reaches the function with catch in it. The return address does not matter, as throw can go up several levels of stack frames if necessary.
In C++, if you have no try/catch block then this:
Method A calls
Method B calls
Method C. Method C throws an exception
will terminate the application. You must have a try/catch block in your code if want to avoid termination.
Exactly how return addresses and thrown exceptions interact is left to the details of how your particular compiler implements exception handling. To say anything about it with much certainty, someone here would have to be familiar with those internal details.
It's certainly conceivable that a buffer overrun could corrupt data that is only used by the exception handling (thus causing a throw to fail) while leaving the return address intact (thus allowing a normal return to succeed). But again, that depends on how your compiler is making use of the stack. (On a different compiler you might get totally different symptoms). It's also possible that the corruption has caused other problems that you simply haven't noticed yet. Or that such corruption will cause problems in the future after the next time you change your code. If the stack (or other memory that C++ depends on) becomes corrupted, then just about anything could happen.
Wither some educated guess-work or knowledge of the compiler details, I'm sure someone could eventually answer the specific questions about return addresses and how throwing works. However, I really think these are the wrong questions to be asking.
If you actually know you have a buffer overrun, then there's no point in trying to answer these questions. Just fix the overrun.
If you only suspect you have an overrun or are trying to track down how it's happening, then try stepping through your code in the debugger and watch for changes to memory outside the bounds of your variables. Or perhaps, alter your function to always throw and then start commenting out suspicious parts of your process one by one. Once the throw starts working again, you can then take a closer look at the code you last commented since the problem is most likely there. If these suggestions don't help, then I think the real question to ask here is "How do I track down memory corruption that only affects throwing an exception?".
I wanted to know why Access Violation occurs for cout and Stack Overflow for printf in the following two code snippets.
I wanted to know why Access Violation for the first code instead of the Stack Overflow.
First code which I get Access Violation :
void Test();
void Test()
{
static int i = 0;
cout << i++ << endl;
Test();
}
int main()
{
Test();
return 0;
}
Second code which I get Stack Overflow :
void Test();
void Test()
{
static int i = 0;
printf("%d\n", i++);
Test();
}
int main()
{
Test();
return 0;
}
I assume you understand that both functions crash due to exhaustion of the stack after an attempt at infinite recursion. I think what you are asking is: why would the cout example not crash with "Stack Overflow" also?
I do not think the answer has to do with the compiler's detection of tail recursion. If the compiler optimized the recursion away, neither example should crash.
I have a guess as to what is going on. "Stack Overflow" exception is implemented in some cases (e.g., Windows) with a single Virtual Memory "guard page" allocated at the end of the stack. When a stack access hits this guard page a special exception type is generated.
Since the Intel small-granularity page is 4096 bytes long, the guard page stands guard over a range of memory that size. If a function call allocates more than 4096 bytes of local variables, it is possible that the first stack access from it will actually stretch beyond the guard page. The next page can be expected to be unreserved memory, so an access violation would make sense in that case.
Of course you don't explicitly declare any local variables in your example. I would assume that one of the operator<<() methods allocates more than a page of local variables. In other words, that the Access Violation occurs near the beginning of an operator<<() method or some other part of the cout implementation (temporary object constructors, etc.)
Also, even in the function you wrote, the operator<<() implementations are going to need to create some storage for intermediate results. That storage is probably allocated as local storage by the compiler. I doubt it would add up to 4k in your example, though.
The only way to really understand would be to see a stack trace of the access violation to see what instruction is triggering it.
Got a stack trace of the access violation and a disassembly around the area of the faulting opcode?
If you are using the Microsoft C compiler, another possibility is that printf() and your own function were compiled with /Ge and operator<<() was not, or that only your function was compiled with /Ge and factors similar to those described above coincidentally cause the behavior you see -- because in the printf() example the crash happens while your function is being called and in the operator<<() case while you are calling the library.
Both recursive functions will never stop.
It seems that in the second case, no tail optimization is done by the compiler, thus the stack overflow.
Both functions trigger a stack overflow on my machine. I am compiling it with MS Visual Studio 2005. Maybe you should specify your platform & compiler, that will help to investigate...
Maybe you compile something in the debug mode and your "cout" implementation includes some checkups, that cannot be performed due to the stack corruption? Maybe your compiler generated the code, that tries to recover from stack overflow and pops an invalid return address? Maybe you are running it on the mobile device? Hard to say without knowing the platform and the compiler.
Infinite recursive call is for the stack overflow. As for the access violation... it really depends on the implementation of STL streams. You need to take a look on the source code of streams to find out...
Though most people misunderstood your question, the answer is in there.
The second example ends with stack overflow because every function call is pushing a frame onto the stack. Eventually, it gets too big. I agree with Cătălin Pitiș, that it'd be hard to know why the streams example is ending with an access violation without looking at the source.
this remembers me of the problem of the stack being corrupted and the debugger not catching the failing program crashing