I've got a strange problem here. Assume that I have a class with some virtual methods. Under a certain circumstances an instance of this class should call one of those methods. Most of the time no problems occur on that stage, but sometimes it turns out that virtual method cannot be called, because the pointer to that method is NULL (as shown in VS), so memory access violation exception occurs. How could that happen?
Application is pretty large and complicated, so I don't really know what low-level steps lead to this situation. Posting raw code wouldn't be useful.
UPD: Ok, I see that my presentation of the problem is rather indefinite, so schematically code looks like
void MyClass::FirstMethod() const { /* Do stuff */ }
void MyClass::SecondMethod() const
{
// This is where exception occurs,
// description of this method during runtime in VS looks like 0x000000
FirstMethod();
}
No constructors or destructors involved.
Heap corruption is a likely candidate. The v-table pointer in the object is vulnerable, it is usually the first field in the object. A buffer overflow for some kind of other object that happens to be adjacent to the object will wipe the v-table pointer. The call to a virtual method, often much later, will blow.
Another classic case is having a bad "this" pointer, usually NULL or a low value. That happens when the object reference on which you call the method is bad. The method will run as usual but blow up as soon as it tries to access a class member. Again, heap corruption or using a pointer that was deleted will cause this. Good luck debugging this; it is never easy.
Possibly you're calling the function (directly or indirectly) from a constructor of a base class which itself doesn't have that function.
Possibly there's a broken cast somewhere (such as a reinterpret_cast of a pointer when there's multiple inheritance involved) and you're looking at the vtable for the wrong class.
Possibly (but unlikely) you have somehow trashed the vtable.
Is the pointer to the function null just for this object, or for all other objects of the same type? If the former, then the vtable pointer is broken, and you're looking in the wrong place. If the latter, then the vtable itself is broken.
One scenario this could happen in is if you tried to call a pure virtual method in a destructor or constructor. At this point the virtual table pointer for the method may not be initialized causing a crash.
Is it possible the "this" pointer is getting deleted during SecondMethod's processing?
Another possibility is that SecondMethod is actually being called with an invalid pointer right up front, and that it just happens to work (by undefined behavior) up to the nested function call which then fails. If you're able to add print code, check to see if "this" and/or other pointers being used is something like 0xcdcdcdcd or 0xfdfdfdfd at various points during execution of those methods. Those values are (I believe) used by VS on memory alloc/dealloc, which may be why it works when compiled in debug mode.
What you are most likely seeing is a side-effect of the actual problem. Most likely heap or memory corruption, or referencing a previously freed object or null pointer.
If you can consistently have it crash at the same place and can figure out where the null pointer is being loaded from then I suggest using the debugger and put a breakpoint on 'write' at that memory location, once the breakpoint is trigerred then most likely you are viewing the code that has actually caused the corruption.
If memory access violation happens only when Studio fails to show method address, then it could be caused by missing debug information. You probably are debugging the code compiled with release (non-debug) compiler/linker flags.
Try to enable some debug info in C++ properties of project, rebuild and restart debugger. If it will help, you will see all normal traceable things like stack, variables etc.
If your this pointer is NULL, corruption is unlikely. Unless you're zeroing memory you shouldn't have.
You didn't say if you're debugging Debug (not optimized) or Release (optimized) build. Typically, in Release build optimizer will remove this pointer if it is not needed. So, if you're debugging optimized build, seeing this pointer as 0 doesn't mean anything. You have to rely on the deassembly to tell you what's going on. Try turning off optimization in your Release build if you cannot reproduce the problem in Debug build. When debugging optimized build, you're debugging assembly not C++.
If you're already debugging a non-optimized build, make sure you have a clean rebuild before spending too much time debugging corrupted images. Debug builds are typically linked incrementally and incremental linker are known to produce problems like this. If you're running Debug build with clean build and still couldn't figure out what went wrong, post the stack dump and more code. I'm sure we can help you figure it out.
Related
I have some code I wrote a few years ago. It has been working fine, but after a recent rebuild with some new, unrelated code elsewhere, it is no longer working. This is the code:
//myobject.h
...
inline CMapStringToOb* GetMap(void) {return (m_lpcMap);};
...
The above is accessed from the main app like so:
//otherclass.cpp
...
CMapStringToOb* lpcMap = static_cast<CMyObject*>(m_lpcBaseClass)->GetMap();
...
Like I said, this WAS working for a long time, but it's just decided to start failing as of our most recent build. I have debugged into this, and I am able to see that, in the code where the pointer is set, it is correctly setting the memory address to an actual value. I have even been able to step into the set function, write down the memory address, then move to this function, let it get 0xfdfdfdfd, and then manually get the memory address in the debugger. This causes the code to work. Now, from what I've read, 0xfdfdfdfd means guarding bytes or "no man's land", but I don't really understand what the implications of that are. Supposedly it also means an off by one error, but I don't understand how that could happen, if the code was working before.
I'm assuming from the Hungarian notation that you're using Visual Studio. Since you do know the address that holds the map pointer, start your program in the debugger and set a data breakpoint when that map pointer changes (the memory holding the map pointer, not the map pointed to). Then you'll find out exactly when it's getting overwritten.
0xfdfdfdfd typically implies that you have accessed memory that you weren't supposed to.
There is a good chance the memory was allocated and subsequently freed. So you're using freed memory.
static_cast can modify a pointer and you have an explicit cast to CMyObject and an implicit cast to CMapStringToOb. Check the validity of the pointer directly returned from GetMap().
Scenarios where "magic" happens almost always come back to memory corruption. I suspect that somewhere else in your code you've modified memory incorrectly, and it's resulting in this peculiar behavior. Try testing some different ways of entering this part of the code. Is the behavior consistent?
This could also be caused by an incorrectly built binary. Try cleaning and rebuilding your project.
I have a class B that inherits the class A with some virtual functions. Class B also has a virtual function (foo) that seems to have no address. When i walk with the debugger it points that foo has 0x00000000 address and when i try to step in it will fail with access violation at 0x00000005. If i make that function not virtual the debugger steps in and will work fine until i reach a std::vector. There when i call push_back it will fail with the same access violation at address 0x000000005 while writing some stuff at address 0xabababab, and the call stack points to a mutex lock in insert function.
Note: I'm not using any other thread and the incremental linker will crash every time i compile. Only the full linker will successfully create the exe. The compiler is from Visual Studio 2008 pro and this problem started to occur when stripping out unused source files and source code.
Unfortunately i was unable to revert to the previous state, in order to spot the change that created this.
How can i detect the source of the problem, without reverting the entire project? Also has anyone encountered this kind of error, maybe it might the same cause.
You guess that the virtual table is broken, but that's unlikely, because vtables are usually stored in read-only memory.
I can think of two reasons for this behavior:
The object you are using has been deleted. It may work by chance if the memory where the object used to be, but fail miserably if it get overwritten.
The object you are using is not of dynamic type B. Maybe it is of type A or maybe of an unrelated type.
I have successfully tracked this kind of issues with printf debugging: Add a few lines with printf("XXX %p", this); in the constructor of B, the destructor, the virtual functions and the failing function, and you'll be able to deduce what is happening.
Yes, I know, printf debugging is not cool...
You are calling a virtual function on a null pointer. The compiler adds code that will use a hidden pointer in the object to locate what is the final overrider, and that operation is failing. When you change the function to non-virtual, the call is dispatched statically, but again, access to members fail as the this pointer is null.
You should check the validity of the object on which you are calling the method in your code.
I've been struggling with a heap corruption problem for a few days. I was first warned by the vs 2005 debugger that I may have corrupted the heap, after deleting an object I had previously new'ed. Doing research on this problem led me to gflags and the page heap setting. After enabling this setting for my particular image, it supposedly pointed me to the line that is actually causing the corruption.
Gflags identified the constructor for the object in question as the culprit. The object derives as follows:
class POPUPS_EXPORT MLUNumber : public MLUBase
{
...
}
class POPUPS_EXPORT MLUBase : public BusinessLogicUnit
{
...
}
I can instantiate an MLUNumber in a separate thread, and no heap corruption occurs.
I can instantiate a different class, that also inherits from MLUBase, that does not cause heap corruption.
The access violation raises due to the corruption occurs on the opening brace of the constructor, which appears to be because of the implicit initializing of the object (?).
The base class constructor (MLUBase) successfully finishes.
From digging with the memory window in vs 2005, it appears that there was not enough space allocated for the actual object. My guess is that enough was allocated for the base class only.
The line causing the fault:
BusinessLogicUnit* biz = new MLUNumber();
I'm hoping for either a reason that might cause this, or another troubleshooting step to follow.
Unfortunately, with the information given, it's not possible to definitively diagnose the problem.
Some things you may want to check:
Make sure BusinessLogicUnit has a virtual destructor. When deleteing objects through a base pointer, a virtual destructor must be present in the base class for the subclass to be properly destructed.
Make sure you're building all source files with the same preprocessor flags and compiler options. A difference in flags (perhaps between debug/release flags?) could result in a change in structure size, and thus an inconsistency between sizes reported in different source files.
It's possible for some types of heap corruption to go undetected, even with your gflags settings. Audit your other heap uses to try to find the source of your issues as well. Ideally you should put together a minimal test case that will reliably crash, but with a minimum amount of activity, so you can narrow down the cause.
Try a clean solution and rebuild; I've occasionally seen timestamps getting screwed up, and an old object file can get in with an out-of-date structure definition. Worth checking at least :)
BusinessLogicUnit* biz = new MLUNumber();
How do you delete the memory? Using the base-class pointer? Have you made the destructor of BusinessLogicUnit virtual? It must be virtual.
class BusinessLogicUnit
{
public:
//..
virtual ~BusinessLogicUnit(); //it must be virtual!
};
Otherwise deleting the derived class object through the base-class pointer invokes undefined behavior as per the C++ Standard.
BusinessLogicUnit is not an MLUNumber. Why would you allocate this way? Instead
BusinessLogicUnit* biz = new BusinessLogicUnit();
Or maybe you do something like this?
struct A
{
SomeType & m_param;
A(SomeType & param) : m_param(param)
{
...use m_param here...
}
};
A a(SomeType()); // passing a temporary by reference
Then that's undefined behaviour, because the referenced temporary dies right after m_param(param) happens..
I agree with bdonlan that there isn't enough information yet to figure out what's wrong. There are a lot of good suggestions here, but just guessing possible reasons why an application is crashing is not a smart way to root cause an issue.
You've done the right thing by enabling instrumentation (pageheap) to help you narrow down the problem. I would continue down this path by finding out exactly which memory address is causing the access violation (and where the address came from).
This guy:
virtual phTreeClass* GetTreeClass() const { return (phTreeClass*)m_entity_class; }
When called, crashed the program with an access violation, even after a full recompile. All member functions and virtual member functions had correct memory addresses (I hovered mouse over the methods in debug mode), but this function had a bad memory address: 0xfffffffc.
Everything looked okay: the 'this' pointer, and everything works fine up until this function call. This function is also pretty old and I didn't change it for a long time. The problem just suddenly popped up after some work, which I commented all out to see what was doing it, without any success.
So I removed the virtual, compiled, and it works fine. I add virtual, compiled, and it still works fine! I basically changed nothing, and remember that I did do a full recompile earlier, and still had the error back then.
I wasn't able to reproduce the problem. But now it is back. I didn't change anything. Removing virtual fixes the problem.
Don't ever use C-style casts with polymorphic types unless you're seriously sure of what you're doing. The overwhelming probability is that you cast it to a type that it wasn't. If your pointers don't implicitly cast (because they cast to a base class, which is safe) then you're doing it wrong.
Compilers and linkers are pieces of software written by human like any other, and thus inherently cannot be error-free..
We occasionally run into such inexplicable issues and fixes too. There's a myth going around here that deleting the ncb file once fixed a build..
Given that recompiling originally fixed the problem, try doing a full clean and rebuild first.
If that fails, then it looks extremely likely that even though your this pointer appears correct to you, it is in fact deleted/deconstructed and pointed at garbage memory that just happens to look like the real object that was there before. If you're using gdb to debug, the first word at the object's pointer will be the vtable. If you do an x/16xw <addr> (for example) memory dump at that location gdb will tell you what sort of object's vtable resides there. If it's the parent-most type then the object is definitely gone.
Alternately if the this pointer isthe same every time you can put a breakpoint in the class destructor with the condition that this == known_addr.
We have been debugging a strange case for some days now, and have somewhat isolated the bug, but it still doesn't make any sense. Perhaps anyone here can give me a clue about what is going on.
The problem is an access violation that occur in a part of the code.
Basically we have something like this:
void aclass::somefunc() {
try {
erroneous_member_function(*someptr);
}
catch (AnException) {
}
}
void aclass::erroneous_member_function(const SomeObject& ref) {
// { //<--scope here error goes away
LargeObject obj = Singleton()->Object.someLargeObj; //<-remove this error goes away
//DummyDestruct dummy1//<-- this is not destroyed before the unreachable
throw AnException();
// } //<--end scope here error goes away
UnreachableClass unreachable; //<- remove this, and the error goes away
DummyDestruct dummy2; //<- destructor of this object is called!
}
While in the debugger it actually looks like it is destructing the UnreachableClass, and when I insert the DummyDestruct object this does not get destroyed before the strange destructor are called. So it is not seem like the destruction of the LargeObject is going awry.
All this is in the middle of production code, and it is very hard to isolate it to a small example.
My question is, does anyone have a clue about what is causing this, and what is happening? I have a quite full featured debugger available (Embarcadero RAD studio), but now I am not sure what to do with it.
Can anyone give me some advise on how to proceed?
Update:
I placed a DummyDestruct object beneath the throw clause, and placed a breakpoint in the destructor. The destructor for this object is entered (and its only us is in this piece of code).
With the information you have provided, and if everything is as you state, the only possible answer is a bug in the compiler/optimizer. Just add the extra scope with a comment (This is, again, if everything is exactly as you have stated).
Stuff like this sometimes happens due to writing through uninitialized pointers, out of bounds array access, etc. The point at which the error is caused may be quite removed from the place where it manifests. However, based on the symptoms you describe it seems to be localized in this function. Could the copy constructor of LargeObject be misbehaving? Is ref being used? Perhaps somePtr isn't pointing to a valid SomeObject. Is Singleton() returning a pointer to a valid object?
Compiler error is also a possibility, especially with aggressive optimization turned on. I would try to recreate the bug with no optimizations.
Time to practice my telepathic debugging skills:
My best guess is your application has a stack corruption bug. This can write junk over the call stack, which means the debugger is incorrectly reporting the function when you break, and it's not really in the destructor. Either that or you are incorrectly interpreting the debugger's information and the object really is being destructed correctly, but you don't know why!
If stack corruption is the case you're going to have a really tough time working out what the root cause is. This is why it's important to implement tonnes of diagnostics (eg. asserts) throughout your program so you can catch the stack corruption when it happens, rather than getting stuck on its weird side effects.
This might be a real long shot but I'm going to put it out there anyway...
You say you use borland - what version? And you say you see the error in a string - STL? Do you include winsock2 at all in your project?
The reason I ask is that I had a problem when using borland 6 (2002) and winsock - the header seemed to mess up the structure packing and meant different translation units had a different idea of the memory layout of std::string, depending on what headers were included by the translation unit, with predictably disastrous results.
Here's another wild guess, since you mentioned strings. I know of at least one implementation where (STL) string copying is done in a lazy manner (i.e., no actual copying of the string contents takes place until a change is made; the "copying" is done by simply having the target string object point to the same buffer as the source). In that particular implementation (GNU) there is a bug whereby excessive copying causes the reference counter (how many objects are using the same actual string memory after supposedly copying it) to roll over to 0, resulting in all sorts of mischief. I haven't encountered this bug myself, but have been told about it by someone who has. (I say this because one would think that the ref counter would be a 32 bit number and the chances of that ever rolling over are pretty slim, to say the least, so I may not be describing the problem properly.)