I've come across a strange issue in some code that I'm working on. Basically what's going on is that whenever I try to get some information from an empty map, the program segfaults. Here's the relevant code:
(note that struct Pair is a data structure that is defined earlier, and sendMasks is a std::map that is good)
std::map<std::string*, struct Pair*>::iterator it;
for(it = sendMasks->begin(); it != sendMasks->end(); it++){ //segfault
//(some code goes here)
}
I know that the pointer to the map is good; I can do
it = sendMasks->begin();
it = sendMasks->end();
before my loop, and it doesn't segfault at all then.
Now, if I put the following test before the for loop, it will segfault:
if( sendMasks->empty() )
As will any other attempt to determine if the map is empty.
This issue will only occur if the map is empty. My only thought on this issue would be that because I am updating sendMasks in a separate thread, that it may not have been updated properly; that however doesn't make any sense because this will only happen if the map is empty, and this code has worked perfectly fine before now. Any other thoughts on what could be happening?
EDIT:
I figured out what the problem was.
At an earlier part in my code, I was making a new char* array and putting that pointer into another array of length 4. I was then putting a NULL character at the end of my new array, but accidentally only did a subscript off of the first array - which went off the end of the array and overwrote a pointer. Somehow, this managed to work properly occasionally. (valgrind doesn't detect this problem)
The sequence was something like this:
object* = NULL; //(overwritten memory)
object->method();
//Inside object::method() :
map->size(); //segfault. Gets an offset of 0x24 into the object,
//which is NULL to begin with. memory location 0x24 = invalid
I wasn't expecting the instance of the object itself to be null, because in Java this method call would fail before it even did that, and in C this would be done quite differently(I don't do much object-oriented programming in C++)
If you are accessing a data structure from different threads, you must have some kind of synchronization. You should ensure that your object is not accessed simultaneously from different threads. As well, you should ensure that the changes done by one of the threads are fully visible to other threads.
A mutex (or critical section if on Windows) should do the trick: the structure should be locked for each access. This ensures the exclusive access to the data structure and makes the needed memory barriers for you.
Welcome to the multithreaded world!
Either:
You made a mistake somewhere, and have corrupted your memory. Run your application through valgrind to find out where.
You are not using locks around access to objects that you share between threads. You absolutely must do this.
I know that the pointer to the map is good; I can do
it = sendMasks->begin();
it = sendMasks->end();
before my loop, and it doesn't segfault at all then.
This logic is flawed.
Segmentation faults aren't some consistent, reliable indicator of an error. They are just one possible symptom of a completely unpredictable system, that comes into being when you have invoked Undefined Behaviour.
this code has worked perfectly fine before now
The same applies here. It may have been silently "working" for years, quietly overwriting bytes in memory that it may or may not have had safe access to.
This issue will only occur if the map is empty.
You just got lucky that, when the map is empty, your bug is evident. Pure chance.
Related
for 2 days now, I have been trying to find out where is the problem in my code. I have isolated the problem like this:
There is loop which look like this:
int test_counter = 0; //Debug purpose only
for (const_iterator i = begin(); i != end(); i++, test_counter++){
if ((*i)->isSoloed()) {
soloed = (*i);
break;
}
}
It is in one method of the class that inherits std::list. The list contains pointers to some dynamically allocated instances of some class, but that is likely not important here.
The list contains exactly two pointers.
The problem is that in about 20% runs, the second pass (test_counter == 1) crashes on (*i)->isSoloed() with access violation. In this case, the iterator value is 0xfeeefeee. This exact value is used by VisualStudio to indicate that the memory has been freed. Well that doesn't make any sense from at least 3 reasons:
No memory gets dealocated here or in another threads
Even if so, how would the iterator get that value???
If in the case of
crash (the exception window) I click break and look at the second
items in the list looks intact and everything seems OK.
Note that this is a multithreaded code which is likely to be the problem here, but the loop is read-only (I even used the const_iterator) and the other thread that has the pointer to this list does not write in the time when the loop is running. But even so, how could that affect the value of the iterator which is a local variable here!
Thanks a lot.
//edit:
I have also noticed 2 more interesting things:
1) if I break the debugging after the access violation occurs, I can go back (by dragging the next commant to execute arrow) before the loop and run it again without any problem. So the problem is unfortunatelly pretty undeterministic.
2) I have never been able to reproduce the problem in release build.
The signature of the method:
MidiMessageSequence::MidiEventHolder* getNextActiveEvent();
and it is called like this:
currentEvent = workingTrackList->getNextActiveEvent();
nothing special really. The application uses JUCE library, but that shouldn't be a problem. I can post more code, just tell me what should I post.
Two possible reasons.
1: The memory that i pointed to is deleted before it is accessed (*i). Try to add a check if(i) before access i (*i)->isSoloed().
2: Try to add a lock before you access the list or list item each time.
I haven't found out where exactly was the problem but I canceled the inheritance and agregated the std::list instead. With this the TrackList class became just a sort of wrapper around the std::list. I put a scoped lock in every method that accesses the list (in the wrapper, so from outside it works the same and I do not need to care about locking from outside) and this pretty much solved the problem.
Shared memory is giving me a hard time and GDB isn't being much help. I've got 32KB of shared memory allocated, and I used shmat to cast it to a pointer to a struct containing A) a bool and B) a queue of objects containing one std::string, three ints, and one bool, plus assorted methods. (I don't know if this matryoshka structure is how you're supposed to do it, but it's the only way I know. Using a message queue isn't an option, and I need to use multiple processes.)
Pushing one object onto the queue works, but when I try to push a second, the program freezes. No error message, no nothing. What's causing this? I doubt it's a lack of memory, but if it is, how much do I need?
EDIT: In case I was unclear -- the objects in the queue are of a class with the five data members described.
EDIT 2: I changed the class of the queue's entries so that it doesn't use std::string. (Embarrassingly enough, I was able to represent the data with a primitive.) The program still freezes on the second push().
EDIT 3: I tried calling front() from the same queue immediately after the first push(), and it froze the program too. Checking the value of the bool outside the queue, however, worked fine, so it's gotta be something wrong with the queue itself.
EDIT 4: As an experiment, I added an std::queue<int> to the struct I was using for the shared memory. It showed the same behavior -- push() worked once, then front() made it freeze. So it's not a problem with the class I'm using for the queue items, either.
This question suggests I'm not likely to solve this with std::queue. Is that so? Should I use boost like it says? (In my case, I'm executing shmget() and shmat() in the parent process and trying to let two child processes communicate, so it's slightly different.)
EDIT 5: The other child process also freezes when it calls front(). A semaphore ensures this happens after the first push() call.
Putting std::string objects into a shared memory segment can't possibly work.
It should work fine for a single process, but as soon as you try to access it from a second process, you'll get garbage: the string will contain a pointer to heap-allocated data, and that pointer is only valid in the process that allocated it.
I don't know why your program freezes, but it is completely pointless to even think about.
As I said in my comment, your problem stems from attempting to use objects that internally require heap allocation in a structure, which should be self contained (i.e. requires no further dynamically allocated memory).
I would tweak your setup, and change the std::string to some fixed size character array, something like
// this structure fits nicely into a typical cache line
struct Message
{
boost::array<char, 48> some_string;
int a, b, c;
bool c;
};
Now, when you need to post something on the queue, copy the string content into some_string. Of course you should size your strings appropriately (and boost::array probably isn't the best - ideally you want some length information too) but you get the idea...
I get a segmentation fault when attempting to delete this.
I know what you think about delete this, but it has been left over by my predecessor. I am aware of some precautions I should take, which have been validated and taken care of.
I don't get what kind of conditions might lead to this crash, only once in a while. About 95% of the time the code runs perfectly fine but sometimes this seems to be corrupted somehow and crash.
The destructor of the class doesn't do anything btw.
Should I assume that something is corrupting my heap somewhere else and that the this pointer is messed up somehow?
Edit : As requested, the crashing code:
long CImageBuffer::Release()
{
long nRefCount = InterlockedDecrement(&m_nRefCount);
if(nRefCount == 0)
{
delete this;
}
return nRefCount;
}
The object has been created with a new, it is not in any kind of array.
The most obvious answer is : don't delete this.
If you insists on doing that, then use common ways of finding bugs :
1. use valgrind (or similar tool) to find memory access problems
2. write unit tests
3. use debugger (prepare for loooong staring at the screen - depends on how big your project is)
It seems like you've mismatched new and delete. Note that delete this; can only be used on an object which was allocated using new (and in case of overridden operator new, or multiple copies of the C++ runtime, the particular new that matches delete found in the current scope)
Crashes upon deallocation can be a pain: It is not supposed to happen, and when it happens, the code is too complicated to easily find a solution.
Note: The use of InterlockedDecrement have me assume you are working on Windows.
Log everything
My own solution was to massively log the construction/destruction, as the crash could well never happen while debugging:
Log the construction, including the this pointer value, and other relevant data
Log the destruction, including the this pointer value, and other relevant data
This way, you'll be able to see if the this was deallocated twice, or even allocated at all.
... everything, including the stack
My problem happened in Managed C++/.NET code, meaning that I had easy access to the stack, which was a blessing. You seem to work on plain C++, so retrieving the stack could be a chore, but still, it remains very very useful.
You should try to load code from internet to print out the current stack for each log. I remember playing with http://www.codeproject.com/KB/threads/StackWalker.aspx for that.
Note that you'll need to either be in debug build, or have the PDB file along the executable file, to make sure the stack will be fully printed.
... everything, including multiple crashes
I believe you are on Windows: You could try to catch the SEH exception. This way, if multiple crashes are happening, you'll see them all, instead of seeing only the first, and each time you'll be able to mark "OK" or "CRASHED" in your logs. I went even as far as using maps to remember addresses of allocations/deallocations, thus organizing the logs to show them together (instead of sequentially).
I'm at home, so I can't provide you with the exact code, but here, Google is your friend, but the thing to remember is that you can't have a __try/__except handdler everywhere (C++ unwinding and C++ exception handlers are not compatible with SEH), so you'll have to write an intermediary function to catch the SEH exception.
Is your crash thread-related?
Last, but not least, the "I happens only 5% of the time" symptom could be caused by different code path executions, or the fact you have multiple threads playing together with the same data.
The InterlockedDecrement part bothers me: Is your object living in multiple threads? And is m_nRefCount correctly aligned and volatile LONG?
The correctly aligned and LONG part are important, here.
If your variable is not a LONG (for example, it could be a size_t, which is not a LONG on a 64-bit Windows), then the function could well work the wrong way.
The same can be said for a variable not aligned on 32-byte boundaries. Is there #pragma pack() instructions in your code? Does your projet file change the default alignment (I assume you're working on Visual Studio)?
For the volatile part, InterlockedDecrement seem to generate a Read/Write memory barrier, so the volatile part should not be mandatory (see http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx).
I have a very simple C++ code here:
char *s = new char[100];
strcpy(s, "HELLO");
delete [] s;
int n = strlen(s);
If I run this code from Visual C++ 2008 by pressing F5 (Start Debugging,) this always result in crash (Access Violation.) However, starting this executable outside the IDE, or using the IDE's Ctrl+F5 (Start without Debugging) doesn't result in any crash. What could be the difference?
I also want to know if it's possible to stably reproduce the Access Violation crash caused from accessing deleted area? Is this kind of crash rare in real-life?
Accessing memory through a deleted pointer is undefined behavior. You can't expect any reliable/repeatable behavior.
Most likely it "works" in the one case because the string is still "sitting there" in the now available memory -= but you cannot rely on that. VS fills memory with debug values to help force crashes to help find these errors.
The difference is that a debugger, and debug libraries, and code built in "debug" mode, likes to break stuff that should break. Your code should break (because it accesses memory it no longer technically owns), so it breaks easier when compiled for debugging and run in the debugger.
In real life, you don't generally get such unsubtle notice. All that stuff that makes things break when they should in the debugger...that stuff's expensive. So it's not checked as strictly in release. You might be able 99 times out of 100 to get away with freeing some memory and accessing it right after, cause the runtime libs don't always hand the memory back to the OS right away. But that 100th time, either the memory's gone, or another thread owns it now and you're getting the length of a string that's no longer a string, but a 252462649-byte array of crap that runs headlong into unallocated (and thus non-existent, as far as you or the runtime should care) memory. And there's next to nothing to tell you what just happened.
So don't do that. Once you've deleted something, consider it dead and gone. Or you'll be wasting half your life tracking down heisenbugs.
Dereferencing a pointer after delete is undefined behavior - anything can happen, including but not limited to:
data corruption
access violation
no visible effects
exact results will depend on multiple factors most of which are out of your control. You'll be much better off not triggering undefined behavior in the first place.
Usually, there is no difference in allocated and freed memory from a process perspective. E.g the process only has one large memory map that grows on demand.
Access violation is caused by reading/writing memory that is not available, ususally not paged in to the process. Various run-time memory debugging utilities uses the paging mechanism to track invalid memory accesses without the severe run time penalty that software memory checking would have.
Anyway your example proves only that an error is sometimes detected when running the program in one environment, but not detected in another environment, but it is still an error and the behaviour of the code above is undefined.
The executable with debug symbols is able to detect some cases of access violations. The code to detect this is contained in the executable, but will not be triggered by default.
Here you'll find an explanation of how you can control behaviour outside of a debugger: http://msdn.microsoft.com/en-us/library/w500y392%28v=VS.80%29.aspx
I also want to know if it's possible
to stably reproduce the Access
Violation crash caused from accessing
deleted area?
Instead of plain delete you could consider using an inline function that also sets the value of the deleted pointer to 0/NULL. This will typically crash if you reference it. However, it won't complain if you delete it a second time.
Is this kind of crash rare in
real-life?
No, this kind of crash is probably behind the majority of the crashes you and I see in software.
I have a very strange bug cropping up right now in a fairly massive C++ application at work (massive in terms of CPU and RAM usage as well as code length - in excess of 100,000 lines). This is running on a dual-core Sun Solaris 10 machine. The program subscribes to stock price feeds and displays them on "pages" configured by the user (a page is a window construct customized by the user - the program allows the user to configure such pages). This program used to work without issue until one of the underlying libraries became multi-threaded. The parts of the program affected by this have been changed accordingly. On to my problem.
Roughly once in every three executions the program will segfault on startup. This is not necessarily a hard rule - sometimes it'll crash three times in a row then work five times in a row. It's the segfault that's interesting (read: painful). It may manifest itself in a number of ways, but most commonly what will happen is function A calls function B and upon entering function B the frame pointer will suddenly be set to 0x000002. Function A:
result_type emit(typename type_trait<T_arg1>::take _A_a1) const
{ return emitter_type::emit(impl_, _A_a1); }
This is a simple signal implementation. impl_ and _A_a1 are well-defined within their frame at the crash. On actual execution of that instruction, we end up at program counter 0x000002.
This doesn't always happen on that function. In fact it happens in quite a few places, but this is one of the simpler cases that doesn't leave that much room for error. Sometimes what will happen is a stack-allocated variable will suddenly be sitting on junk memory (always on 0x000002) for no reason whatsoever. Other times, that same code will run just fine. So, my question is, what can mangle the stack so badly? What can actually change the value of the frame pointer? I've certainly never heard of such a thing. About the only thing I can think of is writing out of bounds on an array, but I've built it with a stack protector which should come up with any instances of that happening. I'm also well within the bounds of my stack here. I also don't see how another thread could overwrite the variable on the stack of the first thread since each thread has it's own stack (this is all pthreads). I've tried building this on a linux machine and while I don't get segfaults there, roughly one out of three times it will freeze up on me.
Stack corruption, 99.9% definitely.
The smells you should be looking carefully for are:-
Use of 'C' arrays
Use of 'C' strcpy-style functions
memcpy
malloc and free
thread-safety of anything using pointers
Uninitialised POD variables.
Pointer Arithmetic
Functions trying to return local variables by reference
I had that exact problem today and was knee-deep in gdb mud and debugging for a straight hour before occurred to me that I simply wrote over array boundaries (where I didn't expect it the least) of a C array.
So, if possible, use vectors instead because any decend STL implementation will give good compiler messages if you try that in debug mode (whereas C arrays punish you with segfaults).
I'm not sure what you're calling a "frame pointer", as you say:
On actual execution of that
instruction, we end up at program
counter 0x000002
Which makes it sound like the return address is being corrupted. The frame pointer is a pointer that points to the location on the stack of the current function call's context. It may well point to the return address (this is an implementation detail), but the frame pointer itself is not the return address.
I don't think there's enough information here to really give you a good answer, but some things that might be culprits are:
incorrect calling convention. If you're calling a function using a calling convention different from how the function was compiled, the stack may become corrupted.
RAM hit. Anything writing through a bad pointer can cause garbage to end up on the stack. I'm not familiar with Solaris, but most thread implementations have the threads in the same process address space, so any thread can access any other thread's stack. One way a thread can get a pointer into another thread's stack is if the address of a local variable is passed to an API that ultimately deals with the pointer on a different thread. unless you synchronize things properly, this will end up with the pointer accessing invalid data. Given that you're dealing with a "simple signal implementation", it seems like it's possible that one thread is sending a signal to another. Maybe one of the parameters in that signal has a pointer to a local?
There's some confusion here between stack overflow and stack corruption.
Stack Overflow is a very specific issue cause by try to use using more stack than the operating system has allocated to your thread. The three normal causes are like this.
void foo()
{
foo(); // endless recursion - whoops!
}
void foo2()
{
char myBuffer[A_VERY_BIG_NUMBER]; // The stack can't hold that much.
}
class bigObj
{
char myBuffer[A_VERY_BIG_NUMBER];
}
void foo2( bigObj big1) // pass by value of a big object - whoops!
{
}
In embedded systems, thread stack size may be measured in bytes and even a simple calling sequence can cause problems. By default on windows, each thread gets 1 Meg of stack, so causing stack overflow is much less of a common problem. Unless you have endless recursion, stack overflows can always be mitigated by increasing the stack size, even though this usually is NOT the best answer.
Stack Corruption simply means writing outside the bounds of the current stack frame, thus potentially corrupting other data - or return addresses on the stack.
At it's simplest:-
void foo()
{
char message[10];
message[10] = '!'; // whoops! beyond end of array
}
That sounds like a stack overflow problem - something is writing beyond the bounds of an array and trampling over the stack frame (and probably the return address too) on the stack. There's a large literature on the subject. "The Shell Programmer's Guide" (2nd Edition) has SPARC examples that may help you.
With C++ unitialized variables and race conditions are likely suspects for intermittent crashes.
Is it possible to run the thing through Valgrind? Perhaps Sun provides a similar tool. Intel VTune (Actually I was thinking of Thread Checker) also has some very nice tools for thread debugging and such.
If your employer can spring for the cost of the more expensive tools, they can really make these sorts of problems a lot easier to solve.
It's not hard to mangle the frame pointer - if you look at the disassembly of a routine you will see that it is pushed at the start of a routine and pulled at the end - so if anything overwrites the stack it can get lost. The stack pointer is where the stack is currently at - and the frame pointer is where it started at (for the current routine).
Firstly I would verify that all of the libraries and related objects have been rebuilt clean and all of the compiler options are consistent - I've had a similar problem before (Solaris 2.5) that was caused by an object file that hadn't been rebuilt.
It sounds exactly like an overwrite - and putting guard blocks around memory isn't going to help if it is simply a bad offset.
After each core dump examine the core file to learn as much as you can about the similarities between the faults. Then try to identify what is getting overwritten. As I remember the frame pointer is the last stack pointer - so anything logically before the frame pointer shouldn't be modified in the current stack frame - so maybe record this and copy it elsewhere and compare upon return.
Is something meaning to assign a value of 2 to a variable but instead is assigning its address to 2?
The other details are lost on me but "2" is the recurring theme in your problem description. ;)
I would second that this definitely sounds like a stack corruption due to out of bound array or buffer writing. Stack protector would be good as long as the writing is sequential, not random.
I second the notion that it is likely stack corruption. I'll add that the switch to a multi-threaded library makes me suspicious that what has happened is a lurking bug has been exposed. Possibly the sequencing the buffer overflow was occurring on unused memory. Now it's hitting another thread's stack. There are many other possible scenarios.
Sorry if that doesn't give much of a hint at how to find it.
I tried Valgrind on it, but unfortunately it doesn't detect stack errors:
"In addition to the performance penalty an important limitation of Valgrind is its inability to detect bounds errors in the use of static or stack allocated data."
I tend to agree that this is a stack overflow problem. The tricky thing is tracking it down. Like I said, there's over 100,000 lines of code to this thing (including custom libraries developed in-house - some of it going as far back as 1992) so if anyone has any good tricks for catching that sort of thing, I'd be grateful. There's arrays being worked on all over the place and the app uses OI for its GUI (if you haven't heard of OI, be grateful) so just looking for a logical fallacy is a mammoth task and my time is short.
Also agreed that the 0x000002 is suspect. It is about the only constant between crashes. Even weirder is the fact that this only cropped up with the multi-threaded switch. I think that the smaller stack as a result of the multiple-threads is what's making this crop up now, but that's pure supposition on my part.
No one asked this, but I built with gcc-4.2. Also, I can guarantee ABI safety here so that's also not the issue. As for the "garbage at the end of the stack" on the RAM hit, the fact that it is universally 2 (though in different places in the code) makes me doubt that as garbage tends to be random.
It is impossible to know, but here are some hints that I can come up with.
In pthreads you must allocate the stack and pass it to the thread. Did you allocate enough? There is no automatic stack growth like in a single threaded process.
If you are sure that you don't corrupt the stack by writing past stack allocated data check for rouge pointers (mostly uninitialized pointers).
One of the threads could overwrite some data that others depend on (check your data synchronisation).
Debugging is usually not very helpful here. I would try to create lots of log output (traces for entry and exit of every function/method call) and then analyze the log.
The fact that the error manifest itself differently on Linux may help. What thread mapping are you using on Solaris? Make sure you map every thread to it's own LWP to ease the debugging.
Also agreed that the 0x000002 is suspect. It is about the only constant between crashes. Even weirder is the fact that this only cropped up with the multi-threaded switch. I think that the smaller stack as a result of the multiple-threads is what's making this crop up now, but that's pure supposition on my part.
If you pass anything on the stack by reference or by address, this would most certainly happen if another thread tried to use it after the first thread returned from a function.
You might be able to repro this by forcing the app onto a single processor. I don't know how you do that with Sparc.