Segmentation fault on one Linux machine but not another with C++ code - c++

I have been having a peculiar problem. I have developed a C++ program on a Linux cluster at work. I have tried to use it home on an Ubuntu 14.04 machine, but the program, which is composed of 6 files: main.hpp,main.cpp (dependent on) sarsa.hpp,sarsa.cpp (class Sarsa) (dependent on) wec.hpp,wec.cpp, does compile, but when I run it it either returns segmenation fault or does not enter one fundamental function of the class Sarsa.
The main code calls the constructor and setter functions without problems:
Sarsa run;
run.setVectorSize(memory,3,tilings,1000);
etc.
However, it cannot run the public function episode , since learningRate, which should contain a large integer, returns 0 for all episodes (iterations).
learningRate[episode]=run.episode(numSteps,graph);}
I tried to debug the code with gdb, which has returned:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000408f4a in main () at main.cpp:152
152 learningRate[episode]=run.episode(numSteps,graph);}
I also tried valgrind, which returned:
==10321== Uninitialised value was created by a stack allocation
==10321== at 0x408CAD: main (main.cpp:112)
But no memory leakage issues.
I was wondering if there was a setting to try to debug the external file sarsa.cpp, since I think that class is likely to be the culpript
In the file, I use C++v11 language (I would be expecting errors at compile-time,though), so I even compiled with g++ -std=c++0x, but there were no improvements.
Unluckily, because of the size of the code, I cannot post it here. I would really appreciate any help with this problem. Am I missing anything obvious? Could you help me at least with the debugging?
Thank you in advance for the help.
Correction:
main.cpp:
Definition of the global array:
`#define numEpisodes 10
int learningRate[numEpisodes];`
Towards the end of the main function:
for (int episode; episode<numEpisodes; episode++) {
if (episode==(numEpisodes-1)) { // Save the simulation data only at the
graph=true;} // last episode
learningRate[episode]=run.episode(numSteps,graph);}

As the code you just added to the question reveals, the problem arises because you did not initialize the episode variable. The behavior of any code that uses its value before you assign one is undefined, so it is entirely reasonable that the program behaves differently in one environment than in another.

A segmentation fault indicates an invalid memory access. Usually this means that somewhere, you're reading or writing past the end of an array, or through an invalid pointer, or through an object that has already been freed. You don't necessarily get the segmentation fault at the point where the bug occurs; for instance, you could write past the end of an array onto heap metadata, which causes a crash later on when you try to allocate or release an unrelated object. So it's perfectly reasonable for a program to appear to work on one system but crash on another.
In this case, I'd start by looking at learningRate[episode]. What is the value of episode? Is it within the bounds of learningRate?

I was wondering if there was a setting to try to debug the external file sarsa.cpp, since I think that class is likely to be the culpript
It's possible to set breakpoints in functions other than main.cpp.
break location
Set a breakpoint at the given location, which can specify a function name, a line number, or an address of an instruction.
At least, I think that's your question. You'll also need to know how to step into functions.
More importantly, you need to learn what your tools are trying to tell you. A segfault is the operating system's reaction to an attempt to dereference memory that doesn't belong to you. One common reason for that is trying to dereference NULL. Another would be trying to dereference a pointer that was never initialized. The Valgrind error message suggests that you may have an unitialized pointer.
Without the code, I can't tell you why the pointer isn't initialized when you run the program on your home system, but is (apparently) initialized when you run it at work. I suspect that you don't have the necessary data on your home system, but you'll need to investigate and figure that out. The fundamental question to keep asking yourself is "what is different between my home computer an dmy work computer?"

Related

Debugging a Corrupted Object on the Heap

I'm debugging a non-trivial software project where I have a bunch of objects located on the heap. At some point in time (at least) one of these objects gets corrupted.
I added a const member to my class to serve as a canary and indeed, it gets corrupted during executing. Typically I'd add a watchpoint to this variable to figure out when the memory is written to. However, I don't know which instance gets overwritten, as any information stored in the class gets corrupted as well.
I have too many objects to set a watchpoint on each of them and I haven't been able to reproduce with a smaller input set. Running valgrind I see "Invalid read of size 4", which is my canary int of 4 bytes being read but at this point it's already too late.
Any suggestions on how to proceed from here?
Probably this won't be specific enough, but when I had a similar problem, here is what I ended up doing. I'm assuming you can reproduce your problem in a deterministic fashion.
My strategy was to find which instance caused the problem first. This I did with a counter on a specific line that exposes the symptom. For example, on Visual Studio, I would setup a breakpoint that triggers on the 100000th hit, so that it never does; but Visual Studio still tells you how many times the breakpoint is encountered during execution. By trial and error, I would find that the problem occurs on the say, 20th time the breakpoint is encountered, and so I would set the breakpoint to trigger on the 19th iteration, to be able to discriminate the appropriate instance before corruption occurred.
Starting from there, I could have the address of the variable that was corrupted before it was, and play with the debugger to find out what is going on: gather enough information about the faulty instance.
Then, I did setup breakpoints at strategic places, which were triggered by conditions : eg. trigger only for an instance with the appropriate address, or with specific values in members.
You'll probably get to when the symptom occurs precisely, but not to the problem, but that's still something.
Hope this helps!
Running valgrind I see "Invalid read of size 4", which is my canary int of 4 bytes being read but at this point it's already too late.
You are confused: if valgrind told you that you are doing invalid read (presumably because the object has been freed), then you are reading danging (already freed) object, and that is exactly your problem.
You shouldn't try to access such objects, and the fact that your canary has been changed / corrupted after you freed / deleted the object is irrelevant.
I managed to find out what was causing my issue. Turns out the object I was looking at never existed in the first place. Like #employed-russian, I wondered whether my object might have been deleted somewhere I wasn't aware of. Putting a breakpoint on the destructor yielded nothing so the only reasonable explanation is the pointer itself being invalid, pointing to memory that wasn't a valid instance of my class.
Lo and behold; the pointer I was dereferencing was left uninitialized by some constructor of another class. I figured it out when I added an explicit check for null and Valgrind's error became Conditional jump or move depends on uninitialised value(s). By using --track-origins=yes, I quickly figured out the source of the uninitialized data, i.e. the pointer missing from the initialization list.
(I know uninitialized values can be detected by the compiler with -Wuninitialized but apparently my version of clang (apple) didn't feel like mentioning it with -Wall enabled.)

C++ - Debug Assertion Failed on program exit

When calling 'exit', or letting my program end itself, it causes a debug assertion:
.
Pressing 'retry' doesn't help me find the source of the problem. I know that it's most likely caused by memory being freed twice somewhere, the problem is that I have no clue where.
The whole program consists of several hundred thousand lines which makes it fairly difficult to take a guess on what exactly is causing the error. Is there a way to accurately tell where the source of the problem is located without having to comb line for line through the code?
The callstack doesn't really help either:
.
this kind of errors usally appear if you delete already deleted objects.
this happens if one object is given to multiple other objects that are supposed to take ownership of that first object and both try to delete it in their destructor.
As the messagebox already suggests, you're probably corrupting your heap in some way. Either you free/delete some memory block that you should not or you're trying to write to some memory block that has already been freed/deleted.
The callstack suggests that this is probably happening when stepping over the last line of your main function. If that is the case, then the problem is probably in the cleanup routines of some user defined types of which you create instances inside the main function. Try setting breakpoints inside the destructors of your own classes and investigate.
It is possible that you are corrupting the heap during the program operation, but it is not being detected until the end of the program, in which case the stack trace would only point to the memory checking routine
There may be a function you can call during operation that checks if the heap is valid, which may bring the fail closer to the point of corruption
HeapValidate is an example of such a routine, but it would depend on what platform you are using
This error can also happen when you use delete[] instead of delete. However, as mentioned, this is only one of many causes.

My code crashes on delete this

I get a segmentation fault when attempting to delete this.
I know what you think about delete this, but it has been left over by my predecessor. I am aware of some precautions I should take, which have been validated and taken care of.
I don't get what kind of conditions might lead to this crash, only once in a while. About 95% of the time the code runs perfectly fine but sometimes this seems to be corrupted somehow and crash.
The destructor of the class doesn't do anything btw.
Should I assume that something is corrupting my heap somewhere else and that the this pointer is messed up somehow?
Edit : As requested, the crashing code:
long CImageBuffer::Release()
{
long nRefCount = InterlockedDecrement(&m_nRefCount);
if(nRefCount == 0)
{
delete this;
}
return nRefCount;
}
The object has been created with a new, it is not in any kind of array.
The most obvious answer is : don't delete this.
If you insists on doing that, then use common ways of finding bugs :
1. use valgrind (or similar tool) to find memory access problems
2. write unit tests
3. use debugger (prepare for loooong staring at the screen - depends on how big your project is)
It seems like you've mismatched new and delete. Note that delete this; can only be used on an object which was allocated using new (and in case of overridden operator new, or multiple copies of the C++ runtime, the particular new that matches delete found in the current scope)
Crashes upon deallocation can be a pain: It is not supposed to happen, and when it happens, the code is too complicated to easily find a solution.
Note: The use of InterlockedDecrement have me assume you are working on Windows.
Log everything
My own solution was to massively log the construction/destruction, as the crash could well never happen while debugging:
Log the construction, including the this pointer value, and other relevant data
Log the destruction, including the this pointer value, and other relevant data
This way, you'll be able to see if the this was deallocated twice, or even allocated at all.
... everything, including the stack
My problem happened in Managed C++/.NET code, meaning that I had easy access to the stack, which was a blessing. You seem to work on plain C++, so retrieving the stack could be a chore, but still, it remains very very useful.
You should try to load code from internet to print out the current stack for each log. I remember playing with http://www.codeproject.com/KB/threads/StackWalker.aspx for that.
Note that you'll need to either be in debug build, or have the PDB file along the executable file, to make sure the stack will be fully printed.
... everything, including multiple crashes
I believe you are on Windows: You could try to catch the SEH exception. This way, if multiple crashes are happening, you'll see them all, instead of seeing only the first, and each time you'll be able to mark "OK" or "CRASHED" in your logs. I went even as far as using maps to remember addresses of allocations/deallocations, thus organizing the logs to show them together (instead of sequentially).
I'm at home, so I can't provide you with the exact code, but here, Google is your friend, but the thing to remember is that you can't have a __try/__except handdler everywhere (C++ unwinding and C++ exception handlers are not compatible with SEH), so you'll have to write an intermediary function to catch the SEH exception.
Is your crash thread-related?
Last, but not least, the "I happens only 5% of the time" symptom could be caused by different code path executions, or the fact you have multiple threads playing together with the same data.
The InterlockedDecrement part bothers me: Is your object living in multiple threads? And is m_nRefCount correctly aligned and volatile LONG?
The correctly aligned and LONG part are important, here.
If your variable is not a LONG (for example, it could be a size_t, which is not a LONG on a 64-bit Windows), then the function could well work the wrong way.
The same can be said for a variable not aligned on 32-byte boundaries. Is there #pragma pack() instructions in your code? Does your projet file change the default alignment (I assume you're working on Visual Studio)?
For the volatile part, InterlockedDecrement seem to generate a Read/Write memory barrier, so the volatile part should not be mandatory (see http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx).

Segmentation fault on valid memory

I get segmentation fault accessing an object which looks valid and fully accessible in gdb. Isn't segmentation is always about inaccessible memory?
EDIT: more details.
The crash happend under gdb so I could examine the object's memory. It had the members set to proper values so there is no chance I was accessing read-only memory. The instruction where crashed happed is kind of Var = Obj.GetMember() where Var, GetMember and the corresponding member are short integers.
Misalignment? I suppose it would cause bus error, not segmentation. I'll try to rebuild all. The problem is that this piece of code runs thousands times a second and the segmentation happens once in several days.
Try complete rebuild (make clean && make), this had helped me a couple of times when I encountered such weird errors.
Late UPD:
If this does fix the problem, it usually means that something is wrong with your makefile, usually screwed-up dependencies between .cpp and .h files, for example: a.cpp includes b.h, but b.h is not listed in a.cpp's dependencies.
You can get faults even if accessing "valid" memory under some circumstances:
you're attempting to modify memory but the specific mapping is readonly
you're attempting to execute code in a memory area that is no-execute
you're attempting to e.g. load/store at a misaligned address and your hardware issues alignment exceptions
Without a look at the coredump, to figure out what the faulting instruction (load/store/execute) was and what exactly the mapping permissions for the accessed memory were it's impossible to distinguish.
Basically, yes. Did you use the core dump to analyze your seg fault?
Code would very much help, but have you done a make clean? If you've increased the size of a class and your dependencies aren't right then there won't be enough space allocated for an instance and that class will then overrun and corrupt whatever it precedes in memory.

Debugging a memory error with GDB and C++

I'm running my C++ program in gdb. I'm not real experienced with gdb, but I'm getting messages like:
warning: HEAP[test.exe]:
warning: Heap block at 064EA560 modified at 064EA569 past requested size of 1
How can I track down where this is happening at? Viewing the memory doesn't give me any clues.
Thanks!
So you're busting your heap. Here's a nice GDB tutorial to keep in mind.
My normal practice is to set a break in known good part of the code. Once it gets there step through until you error out. Normally you can determine the problem that way.
Because you're getting a heap error I'd assume it has to do with something you're putting on the heap so pay special attention to variables (I think you can use print in GDB to determine it's memory address and that may be able to sync you with where your erroring out). You should also remember that entering functions and returning from functions play with the heap so they may be where your problem lies (especially if you messed your heap before returning from a function).
You can probably use a feature called a "watch point". This is like a break point but the debugger stops when the memory is modified.
I gave a rough idea on how to use this in an answer to a different question.
If you can use other tools, I highly recommend trying out Valgrind. It is an instrumentation framework, that can run your code in a manner that allows it to, typically, stop at the exact instruction that causes the error. Heap errors are usually easy to find, this way.
One thing you can try, as this is the same sort of thing as the standard libc, with the MALLOC_CHECK_ envronment variable configured (man libc).
If you keep from exiting gdb (if your application quit's, just use "r" to re-run it), you can setup a memory breakpoint at that address, "hbreak 0x64EA569", also use "help hbreak" to configure condition's or other breakpoitn enable/disable options to prevent excessively entering that breakpoint....
You can just configure a log file, set log ... setup a stack trace on every break, "display/bt -4", then hit r, and just hold down the enter key and let it scroll by
"or use c ## to continue x times... etc..", eventually you will see that same assertion, then you will now have (due to the display/bt) a stacktrace which you can corolate to what code was modifying on that address...
I had similar problem when I was trying to realloc array of pointers to my structures, but instead I was reallocating as array of ints (because I got the code from tutorial and forgot to change it). The compiler wasnt correcting me because it cannot be checked whats in size argument.
My variable was:
itemsetList_t ** iteration_isets;
So in realloc instead of having:
iteration_isets = realloc(iteration_isets, sizeof(itemsetList_t *) * max_elem);
I had:
iteration_isets = realloc(iteration_isets, sizeof(int) * max_elem);
And this caused my heap problem.