Debugging a Corrupted Object on the Heap

Debugging a Corrupted Object on the Heap - c++

I'm debugging a non-trivial software project where I have a bunch of objects located on the heap. At some point in time (at least) one of these objects gets corrupted.
I added a const member to my class to serve as a canary and indeed, it gets corrupted during executing. Typically I'd add a watchpoint to this variable to figure out when the memory is written to. However, I don't know which instance gets overwritten, as any information stored in the class gets corrupted as well.
I have too many objects to set a watchpoint on each of them and I haven't been able to reproduce with a smaller input set. Running valgrind I see "Invalid read of size 4", which is my canary int of 4 bytes being read but at this point it's already too late.
Any suggestions on how to proceed from here?

Probably this won't be specific enough, but when I had a similar problem, here is what I ended up doing. I'm assuming you can reproduce your problem in a deterministic fashion.
My strategy was to find which instance caused the problem first. This I did with a counter on a specific line that exposes the symptom. For example, on Visual Studio, I would setup a breakpoint that triggers on the 100000th hit, so that it never does; but Visual Studio still tells you how many times the breakpoint is encountered during execution. By trial and error, I would find that the problem occurs on the say, 20th time the breakpoint is encountered, and so I would set the breakpoint to trigger on the 19th iteration, to be able to discriminate the appropriate instance before corruption occurred.
Starting from there, I could have the address of the variable that was corrupted before it was, and play with the debugger to find out what is going on: gather enough information about the faulty instance.
Then, I did setup breakpoints at strategic places, which were triggered by conditions : eg. trigger only for an instance with the appropriate address, or with specific values in members.
You'll probably get to when the symptom occurs precisely, but not to the problem, but that's still something.
Hope this helps!

Running valgrind I see "Invalid read of size 4", which is my canary int of 4 bytes being read but at this point it's already too late.
You are confused: if valgrind told you that you are doing invalid read (presumably because the object has been freed), then you are reading danging (already freed) object, and that is exactly your problem.
You shouldn't try to access such objects, and the fact that your canary has been changed / corrupted after you freed / deleted the object is irrelevant.

I managed to find out what was causing my issue. Turns out the object I was looking at never existed in the first place. Like #employed-russian, I wondered whether my object might have been deleted somewhere I wasn't aware of. Putting a breakpoint on the destructor yielded nothing so the only reasonable explanation is the pointer itself being invalid, pointing to memory that wasn't a valid instance of my class.
Lo and behold; the pointer I was dereferencing was left uninitialized by some constructor of another class. I figured it out when I added an explicit check for null and Valgrind's error became Conditional jump or move depends on uninitialised value(s). By using --track-origins=yes, I quickly figured out the source of the uninitialized data, i.e. the pointer missing from the initialization list.
(I know uninitialized values can be detected by the compiler with -Wuninitialized but apparently my version of clang (apple) didn't feel like mentioning it with -Wall enabled.)

Related

Segmentation fault on one Linux machine but not another with C++ code

I have been having a peculiar problem. I have developed a C++ program on a Linux cluster at work. I have tried to use it home on an Ubuntu 14.04 machine, but the program, which is composed of 6 files: main.hpp,main.cpp (dependent on) sarsa.hpp,sarsa.cpp (class Sarsa) (dependent on) wec.hpp,wec.cpp, does compile, but when I run it it either returns segmenation fault or does not enter one fundamental function of the class Sarsa.
The main code calls the constructor and setter functions without problems:
Sarsa run;
run.setVectorSize(memory,3,tilings,1000);
etc.
However, it cannot run the public function episode , since learningRate, which should contain a large integer, returns 0 for all episodes (iterations).
learningRate[episode]=run.episode(numSteps,graph);}
I tried to debug the code with gdb, which has returned:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000408f4a in main () at main.cpp:152
152 learningRate[episode]=run.episode(numSteps,graph);}
I also tried valgrind, which returned:
==10321== Uninitialised value was created by a stack allocation
==10321== at 0x408CAD: main (main.cpp:112)
But no memory leakage issues.
I was wondering if there was a setting to try to debug the external file sarsa.cpp, since I think that class is likely to be the culpript
In the file, I use C++v11 language (I would be expecting errors at compile-time,though), so I even compiled with g++ -std=c++0x, but there were no improvements.
Unluckily, because of the size of the code, I cannot post it here. I would really appreciate any help with this problem. Am I missing anything obvious? Could you help me at least with the debugging?
Thank you in advance for the help.
Correction:
main.cpp:
Definition of the global array:
`#define numEpisodes 10
int learningRate[numEpisodes];`
Towards the end of the main function:
for (int episode; episode<numEpisodes; episode++) {
if (episode==(numEpisodes-1)) { // Save the simulation data only at the
graph=true;} // last episode
learningRate[episode]=run.episode(numSteps,graph);}

As the code you just added to the question reveals, the problem arises because you did not initialize the episode variable. The behavior of any code that uses its value before you assign one is undefined, so it is entirely reasonable that the program behaves differently in one environment than in another.

A segmentation fault indicates an invalid memory access. Usually this means that somewhere, you're reading or writing past the end of an array, or through an invalid pointer, or through an object that has already been freed. You don't necessarily get the segmentation fault at the point where the bug occurs; for instance, you could write past the end of an array onto heap metadata, which causes a crash later on when you try to allocate or release an unrelated object. So it's perfectly reasonable for a program to appear to work on one system but crash on another.
In this case, I'd start by looking at learningRate[episode]. What is the value of episode? Is it within the bounds of learningRate?

I was wondering if there was a setting to try to debug the external file sarsa.cpp, since I think that class is likely to be the culpript
It's possible to set breakpoints in functions other than main.cpp.
break location
Set a breakpoint at the given location, which can specify a function name, a line number, or an address of an instruction.
At least, I think that's your question. You'll also need to know how to step into functions.
More importantly, you need to learn what your tools are trying to tell you. A segfault is the operating system's reaction to an attempt to dereference memory that doesn't belong to you. One common reason for that is trying to dereference NULL. Another would be trying to dereference a pointer that was never initialized. The Valgrind error message suggests that you may have an unitialized pointer.
Without the code, I can't tell you why the pointer isn't initialized when you run the program on your home system, but is (apparently) initialized when you run it at work. I suspect that you don't have the necessary data on your home system, but you'll need to investigate and figure that out. The fundamental question to keep asking yourself is "what is different between my home computer an dmy work computer?"

General way of solving Error: Stack around the variable 'x' was corrupted

I have a program which prompts me the error in VS2010, in debug :
Error: Stack around the variable 'x' was corrupted
This gives me the function where a stack overflow likely occurs, but I can't visually see where the problem is.
Is there a general way to debug this error with VS2010? Would it be possible to indentify which write operation is overwritting the incorrect stack memory?
thanks

Is there a general way to debug this error with VS2010?
No, there isn't. What you have done is to somehow invoke undefined behavior. The reason these behaviors are undefined is that the general case is very hard to detect/diagnose. Sometimes it is provably impossible to do so.
There are however, a somewhat smallish number of things that typically cause your problem:
Improper handling of memory:
Deleting something twice,
Using the wrong type of deletion (free for something allocated with new, etc.),
Accessing something after it's memory has been deleted.
Returning a pointer or reference to a local.
Reading or writing past the end of an array.

This can be caused by several issues, that are generally hard to see:
double deletes
delete a variable allocated with new[] or delete[] a variable allocated with new
delete something allocated with malloc
delete an automatic storage variable
returning a local by reference
If it's not immediately clear, I'd get my hands on a memory debugger (I can think of Rational Purify for windows).

This message can also be due to an array bounds violation. Make sure that your function (and every function it calls, especially member functions for stack-based objects) is obeying the bounds of any arrays that may be used.

Actually what you see is quite informative, you should check in near x variable location for any activity that might cause this error.
Below is how you can reproduce such exception:
int main() {
char buffer1[10];
char buffer2[20];
memset(buffer1, 0, sizeof(buffer1) + 1);
return 0;
}
will generate (VS2010):
Run-Time Check Failure #2 - Stack around the variable 'buffer1' was corrupted.
obviously memset has written 1 char more than it should. VS with option \GS allows to detect such buffer overflows (which you have enabled), for more on that read here: http://msdn.microsoft.com/en-us/library/Aa290051.
You can for example use debuger and step throught you code, each time watch at contents of your variable, how they change. You can also try luck with data breakpoints, you set breakpoint when some memory location changes and debugger stops at that moment,possibly showing you callstack where problem is located. But this actually might not work with \GS flag.
For detecting heap overflows you can use gflags tool.

I was puzzled by this error for hours, I know the possible causes, and they are already mentioned in the previous answers, but I don't allocate memory, don't access array elements, don't return pointers to local variables...
Then finally found the source of the problem:
*x++;
The intent was to increment the pointed value. But due to the precedence ++ comes first, moving the x pointer forward then * does nothing, then writing to *x will be corrupt the stack canary if the parameter comes from the stack, making VS complain.
Changing it to (*x)++ solves the problem.
Hope this helps.

Here is what I do in this situation:
Set a breakpoint at a location where you can see the (correct) value of the variable in question, but before the error happens. You will need the memory address of the variable whose stack is being corrupted. Sometimes I have to add a line of code in order for the debugger to give me the address easily (int *x = &y)
At this point you can set a memory breakpoint (Debug->New Breakpoint->New Data Breakpoint)
Hit Play and the debugger should stop when the memory is written to. Look up the stack (mine usually breaks in some assembly code) to see whats being called.

I usually follow the variable before the complaining variable which usually helps me get the problem. But this can sometime be very complex with no clue as you have seen it. You could enable Debug menu >> Exceptions and tick the 'Win32 exceptions" to catch all exceptions. This will still not catch this exceptions but it could catch something else which could indirectly point to the problem.
In my case it was caused by library I was using. It turnout the header file I was including in my project didn't quite match the actual header file in that library (by one line).
There is a different error which is also related:
0xC015000F: The activation context being deactivated is not the most
recently activated one.
When I got tired of getting the mysterious stack corrupted message on my computer with no debugging information, I tried my project on another computer and it was giving me the above message instead. With the new exception I was able to work my way out.

I encountered this when I made a pointer array of 13 items, then trying to set the 14th item. Changing the array to 14 items solved the problem. Hope this helps some people ^_^

One relatively common source of "Stack around the variable 'x' was corrupted" problem is wrong casting. It is sometimes hard to spot. Here is an example of a function where such problem occurs and the fix. In the function assignValue I want to assign some value to a variable. The variable is located at the memory address passed as argument to the function:
using namespace std;
template<typename T>
void assignValue(uint64_t address, T value)
{
int8_t* begin_object = reinterpret_cast<int8_t*>(std::addressof(value));
// wrongly casted to (int*), produces the error (sizeof(int) == 4)
//std::copy(begin_object, begin_object + sizeof(T), (int*)address);
// correct cast to (int8_t*), assignment byte by byte, (sizeof(int8_t) == 1)
std::copy(begin_object, begin_object + sizeof(T), (int8_t*)address);
}
int main()
{
int x = 1;
int x2 = 22;
assignValue<int>((uint64_t)&x, x2);
assert(x == x2);
}

My code crashes on delete this

I get a segmentation fault when attempting to delete this.
I know what you think about delete this, but it has been left over by my predecessor. I am aware of some precautions I should take, which have been validated and taken care of.
I don't get what kind of conditions might lead to this crash, only once in a while. About 95% of the time the code runs perfectly fine but sometimes this seems to be corrupted somehow and crash.
The destructor of the class doesn't do anything btw.
Should I assume that something is corrupting my heap somewhere else and that the this pointer is messed up somehow?
Edit : As requested, the crashing code:
long CImageBuffer::Release()
{
long nRefCount = InterlockedDecrement(&m_nRefCount);
if(nRefCount == 0)
{
delete this;
}
return nRefCount;
}
The object has been created with a new, it is not in any kind of array.

The most obvious answer is : don't delete this.
If you insists on doing that, then use common ways of finding bugs :
1. use valgrind (or similar tool) to find memory access problems
2. write unit tests
3. use debugger (prepare for loooong staring at the screen - depends on how big your project is)

It seems like you've mismatched new and delete. Note that delete this; can only be used on an object which was allocated using new (and in case of overridden operator new, or multiple copies of the C++ runtime, the particular new that matches delete found in the current scope)

Crashes upon deallocation can be a pain: It is not supposed to happen, and when it happens, the code is too complicated to easily find a solution.
Note: The use of InterlockedDecrement have me assume you are working on Windows.
Log everything
My own solution was to massively log the construction/destruction, as the crash could well never happen while debugging:
Log the construction, including the this pointer value, and other relevant data
Log the destruction, including the this pointer value, and other relevant data
This way, you'll be able to see if the this was deallocated twice, or even allocated at all.
... everything, including the stack
My problem happened in Managed C++/.NET code, meaning that I had easy access to the stack, which was a blessing. You seem to work on plain C++, so retrieving the stack could be a chore, but still, it remains very very useful.
You should try to load code from internet to print out the current stack for each log. I remember playing with http://www.codeproject.com/KB/threads/StackWalker.aspx for that.
Note that you'll need to either be in debug build, or have the PDB file along the executable file, to make sure the stack will be fully printed.
... everything, including multiple crashes
I believe you are on Windows: You could try to catch the SEH exception. This way, if multiple crashes are happening, you'll see them all, instead of seeing only the first, and each time you'll be able to mark "OK" or "CRASHED" in your logs. I went even as far as using maps to remember addresses of allocations/deallocations, thus organizing the logs to show them together (instead of sequentially).
I'm at home, so I can't provide you with the exact code, but here, Google is your friend, but the thing to remember is that you can't have a __try/__except handdler everywhere (C++ unwinding and C++ exception handlers are not compatible with SEH), so you'll have to write an intermediary function to catch the SEH exception.
Is your crash thread-related?
Last, but not least, the "I happens only 5% of the time" symptom could be caused by different code path executions, or the fact you have multiple threads playing together with the same data.
The InterlockedDecrement part bothers me: Is your object living in multiple threads? And is m_nRefCount correctly aligned and volatile LONG?
The correctly aligned and LONG part are important, here.
If your variable is not a LONG (for example, it could be a size_t, which is not a LONG on a 64-bit Windows), then the function could well work the wrong way.
The same can be said for a variable not aligned on 32-byte boundaries. Is there #pragma pack() instructions in your code? Does your projet file change the default alignment (I assume you're working on Visual Studio)?
For the volatile part, InterlockedDecrement seem to generate a Read/Write memory barrier, so the volatile part should not be mandatory (see http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx).

Debugging a memory error with GDB and C++

I'm running my C++ program in gdb. I'm not real experienced with gdb, but I'm getting messages like:
warning: HEAP[test.exe]:
warning: Heap block at 064EA560 modified at 064EA569 past requested size of 1
How can I track down where this is happening at? Viewing the memory doesn't give me any clues.
Thanks!

So you're busting your heap. Here's a nice GDB tutorial to keep in mind.
My normal practice is to set a break in known good part of the code. Once it gets there step through until you error out. Normally you can determine the problem that way.
Because you're getting a heap error I'd assume it has to do with something you're putting on the heap so pay special attention to variables (I think you can use print in GDB to determine it's memory address and that may be able to sync you with where your erroring out). You should also remember that entering functions and returning from functions play with the heap so they may be where your problem lies (especially if you messed your heap before returning from a function).

You can probably use a feature called a "watch point". This is like a break point but the debugger stops when the memory is modified.
I gave a rough idea on how to use this in an answer to a different question.

If you can use other tools, I highly recommend trying out Valgrind. It is an instrumentation framework, that can run your code in a manner that allows it to, typically, stop at the exact instruction that causes the error. Heap errors are usually easy to find, this way.

One thing you can try, as this is the same sort of thing as the standard libc, with the MALLOC_CHECK_ envronment variable configured (man libc).
If you keep from exiting gdb (if your application quit's, just use "r" to re-run it), you can setup a memory breakpoint at that address, "hbreak 0x64EA569", also use "help hbreak" to configure condition's or other breakpoitn enable/disable options to prevent excessively entering that breakpoint....
You can just configure a log file, set log ... setup a stack trace on every break, "display/bt -4", then hit r, and just hold down the enter key and let it scroll by
"or use c ## to continue x times... etc..", eventually you will see that same assertion, then you will now have (due to the display/bt) a stacktrace which you can corolate to what code was modifying on that address...

I had similar problem when I was trying to realloc array of pointers to my structures, but instead I was reallocating as array of ints (because I got the code from tutorial and forgot to change it). The compiler wasnt correcting me because it cannot be checked whats in size argument.
My variable was:
itemsetList_t ** iteration_isets;
So in realloc instead of having:
iteration_isets = realloc(iteration_isets, sizeof(itemsetList_t *) * max_elem);
I had:
iteration_isets = realloc(iteration_isets, sizeof(int) * max_elem);
And this caused my heap problem.

Pointer mysteriously resetting to NULL

I'm working on a game and I'm currently working on the part that handles input. Three classes are involved here, there's the ProjectInstance class which starts the level and stuff, there's a GameController which will handle the input, and a PlayerEntity which will be influenced by the controls as determined by the GameController. Upon starting the level the ProjectInstance creates the GameController, and it will call its EvaluateControls method in the Step method, which is called inside the game loop. The EvaluateControls method looks a bit like this:
void CGameController::EvaluateControls(CInputBindings *pib) {
// if no player yet
if (gc_ppePlayer == NULL) {
// create it
Handle<CPlayerEntityProperties> hep = memNew(CPlayerEntityProperties);
gc_ppePlayer = (CPlayerEntity *)hep->SpawnEntity();
memDelete((CPlayerEntityProperties *)hep);
ASSERT(gc_ppePlayer != NULL);
return;
}
// handles controls here
}
This function is called correctly and the assert never triggers. However, every time this function is called, gc_ppePlayer is set to NULL. As you can see it's not a local variable going out of scope. The only place gc_ppePlayer can be set to NULL is in the constructor or possibly in the destructor, neither of which are being called in between the calls to EvaluateControls. When debugging, gc_ppePlayer receives a correct and expected value before the return. When I press F10 one more time and the cursor is at the closing brace, the value changes to 0xffffffff. I'm at a loss here, how can this happen? Anyone?

set a watch point on gc_ppePlayer == NULL when the value of that expression changes (to NULL or from NULL) the debugger will point you to exactly where it happened.
Try that and see what happens. Look for unterminated strings or mempcy copying into memory that is too small etc ... usually that is the cause of the problem of global/stack variables being overwritten randomly.
To add a watchpoint in VS2005 (instructions by brone)
Go to Breakpoints window
Click New,
Click Data breakpoint. Enter
&gc_ppePlayer in Address box, leave
other values alone.
Then run.
When gc_ppePlayer changes,
breakpoint
will be hit. – brone

Are you debugging a Release or Debug configuration? In release build configuration, what you see in the debugger isn't always true. Optimisations are made, and this can make the watch window show quirky values like you are seeing.
Are you actually seeing the ASSERT triggering? ASSERTs are normally compiled out of Release builds, so I'm guessing you are debugging a release build which is why the ASSERT isn't causing the application to terminate.
I would recommend build a Debug version of the software, and then seeing if gc_ppePlayer is really NULL. If it really is, maybe you are seeing memory heap corruption of some sort where this pointer is being overridden. But if it was memory corruption, it would generally be much less deterministic than you are describing.
As an aside, using global pointer values like this is generally considered bad practice. See if you can replace this with a singleton class if it is truly a single object and needs to be globally accessible.

My first thought is to say that SpawnEntity() is returning a pointer to an internal member that is getting "cleared" when memDelete() is called. It's not clear to me when the pointer is set to 0xffffffff, but if it occurs during the call to memDelete(), then this explains why your ASSERT is not firing - 0xffffffff is not the same as NULL.
How long has it been since you've rebuilt the entire code base? I've seen memory problems like this every now and again that are cleared up by simply rebuilding the entire solution.
Have you tried doing a step into (F11) instead of the step over (F10) at the end of the function? Although your example doesn't show any local variables, perhaps you left some out for the sake of simplicity. If so, F11 will (hopefully) step into the destructors for any of those variables, allowing you to see if one of them is causing the problem.

You have a "fandango on core."
The dynamic initialization is overwriting assorted bits (sic) of memory.
Either directly, or indirectly, the global is being overwritten.
where is the global in memory relative to the heap?
binary chop the dynamically initialized portion until the problem goes away.
(comment out half at a time, recursively)

Depending on what platform you are on there are tools (free or paid) that can quickly figure out this sort of memory issue.
Off the top of my head:
Valgrind
Rational Purify

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js