I dislike pointers, and generally try to write as much code as I can using refs instead.
I've written a very rudimentary "vertical layout" system for a small Win32 app. Most of the Layout methods look like this:
void Control::DoLayout(int availableWidth, int &consumedYAmt)
{
textYPosition = consumedYAmt;
consumedYAmt += measureText(font, availableWidth);
}
They are looped through like so:
int innerYValue = 0;
foreach(control in controls) {
control->DoLayout(availableWidth, innerYValue);
}
int heightOfControl = innerYValue;
It's not drawing its content here, just calculating exactly how much space this control will require (usually it's adding padding too, etc). This has worked great for me.......in debug mode.
I found that in Release mode, I could suddenly see tangible, loggable issues where, when I'm looping through controls and calling DoLayout(), the consumedYAmt variable actually stays at 0 in the outside loop. The most annoying part is that if I put in breakpoints and walk through the code line by line, this stops happening and parts of it are properly updated by the inside "add" methods.
I'm kind of thinking about whether this would be some compiler optimization where they think I'm simply adding the ref flag to ints as a way to optimize memory; or if there's any possibility this actually works in a way different from how it seems.
I would give a minimum reproducible example, but I wasn't able to do so with a tiny commandline app. I get the sense that if this is an optimization, it only kicks in for larger code blocks and indirections.
EDIT: Again sorry for generally low information, but I'm now getting hints that this might be some kind of linker issue. I skipped one part of the inheritance model in my pseudocode: The calling class actually calls "Layout()", which is a non-virtual function on the root definition of the class. This function performs some implementation-neutral logic, and then calls DoLayout() with the same arguments. However, I'm now noticing that if I try adding a breakpoint to Layout(), Visual Studio claims that "The breakpoint will not be hit. No executable code of the debugger's target code type is associated with this line." I am able to add breakpoints to certain other lines, but I'm beginning to notice weird stepping logic where it refuses to go inside certain functions, like Layout. Already tried completely clearing the build folders and rebuilding. I'm going to have to keep looking, since I have to admit this isn't a lot to go on.
Also, random addition: The "controls" list is a vector containing shared_ptr objects. I hadn't suspected the looping mechanism previously but now I'm looking more closely.
"the consumedYAmt variable actually stays at 0"
The behavior you describe is typical for a specific optimization that's more due to the CPU than the compiler. I suspect you're logging consumedYAmt from another thread. The updates to consumedYAmt simply don't make it to that other thread.
This is legal for the CPU, because the C++ compiler didn't put in memory fences. And the CPU compiler didn't put in fences because the variable isn't atomic.
In a small program without threads, this simply doesn't show up, nor does it show in debug mode.
Written by OP
Okay, eventually figured this one out. As simple as the issue was, pinning it down became difficult because of Release mode's debugger seemingly acting in inconsistent ways. When I changed tactic to adding Logging statements in lots of places, I found that my Control class had an "mShowing" variable that was uninitialized in its constructor. In debug mode, it apparently retained uninitialized memory which I guess made it "true" - but in release mode, my best analysis is that memory protections made it default to "false", which as it turns out skipped the main body of the DoLayout method most of the time.
Since through the process, responders were low on information to work with (certainly could've been easier if I posted a longer example), I instead simply upvoted each comment that mentioned uninitialized variables.
Related
I'm currently dealing with one of the strangest bugs I have ever seen. I have this "else if" statement and inside the else-if I have a line of code that is causing a bug to happen elsewhere (my program is kind of complicated so I don't think it would help to post a short snippet of code here because it would be too difficult to explain -- so I apologize in advance if this post seems rather vague).
The issue is that the line of code that is causing the bug is not being called at all. I put a break point at the line and also put a print statement before it but the program never enters that particular "if-else" statement. The bug goes away when I comment out the line and shows up again when I uncomment it. This leads me to believe that the line must be getting called somehow but my break point and prints suggest otherwise.
Has anyone ever heard of something like this happening? Why would a line of code that is not even being called affect the rest of my program? Are there other ways to detect if the line is being called somehow besides using breakpoints and print statements?
I'm using XCode as my IDE and this is a single threaded program (so it's not some weird asynchronous bug)
PROBLEM HAS BEEN SOLVED. SEE TOP ANSWER
It actually may happen in some cases indeed and I already saw it before. If the bug is some slight buffer overflow then the presence/abasence of that line may make the compiler differently optimize the memory layout (ie. not allocate some variables or arrange them in a different way or place segments differently for example) that will by chance not trigger the problem anymore.
The same applies if the bug is a strange race condition: the lack of that line may change slightly the timings (due to differently optimized code) and make the bug come out.
Very long shot: that code may even somehow trigger a compiler bug. But this may be less the case, but it may.
So: yes it's definitely possible and I already saw it. And if something like this is happening to you and you're 100% sure your code is correct then be very careful since something quite nasty may be hiding in the code.
Not that there is enough information here to really answer this question, but perhaps I can try to provide some pointers.
Optimization. Try turning it off if it is turned on at all. When the compiler is optimizing your code, it may make different decisions when a statement is present vs. when it isn't. Whenever I am dealing with bugs that go away when a a seemingly unrelated code construct is changed, it is usually that the optimizer is doing something differently. I suppose a very simple example would be if the statement accesses something that the compiler would otherwise think is never accessed and can be folded away. The "unrelated" line may take an address of something which affects aliasing information, etc. All this isn't to say that the optimizer is wrong, your code likely still does have a bug, but it just explains the weird behaviour.
Debugging. It is very unlikely that the bug is in this line that is not reached. What I would focus on is setting watch points for the variables that are getting incorrect values (assuming you can narrow it down to what is receiving the wrong value).
These very weird issues often indicate an uninitialized variable or something along those lines. If the underlying compiler supports it, you can try providing an option that will cause the program to trap when an uninitialized memory location is being accessed.
Finally, this could be an indication that something is overwriting areas of the stack (and perhaps less likely other memory areas - heap, data). If you have anything that is writing to arrays allocated on the stack, check if you're walking past the end of the array.
Make sure you're using { and } around the bodies in your if() and else clauses. Without them it's easy to wind up with something like this:
if (a)
do_a_work();
if (b)
do_a_and_b_work();
else
do_not_a_work();
Which actually equates to this (and is not what's implied by the indenting):
if (a) {
do_a_work();
}
if (b) {
do_a_and_b_work();
} else {
do_not_a_work();
}
because the brace-less if (a) only takes the first statement below it for its body.
Commenting out your mystery line may be changing which code belongs to which if-else.
I get a segmentation fault when attempting to delete this.
I know what you think about delete this, but it has been left over by my predecessor. I am aware of some precautions I should take, which have been validated and taken care of.
I don't get what kind of conditions might lead to this crash, only once in a while. About 95% of the time the code runs perfectly fine but sometimes this seems to be corrupted somehow and crash.
The destructor of the class doesn't do anything btw.
Should I assume that something is corrupting my heap somewhere else and that the this pointer is messed up somehow?
Edit : As requested, the crashing code:
long CImageBuffer::Release()
{
long nRefCount = InterlockedDecrement(&m_nRefCount);
if(nRefCount == 0)
{
delete this;
}
return nRefCount;
}
The object has been created with a new, it is not in any kind of array.
The most obvious answer is : don't delete this.
If you insists on doing that, then use common ways of finding bugs :
1. use valgrind (or similar tool) to find memory access problems
2. write unit tests
3. use debugger (prepare for loooong staring at the screen - depends on how big your project is)
It seems like you've mismatched new and delete. Note that delete this; can only be used on an object which was allocated using new (and in case of overridden operator new, or multiple copies of the C++ runtime, the particular new that matches delete found in the current scope)
Crashes upon deallocation can be a pain: It is not supposed to happen, and when it happens, the code is too complicated to easily find a solution.
Note: The use of InterlockedDecrement have me assume you are working on Windows.
Log everything
My own solution was to massively log the construction/destruction, as the crash could well never happen while debugging:
Log the construction, including the this pointer value, and other relevant data
Log the destruction, including the this pointer value, and other relevant data
This way, you'll be able to see if the this was deallocated twice, or even allocated at all.
... everything, including the stack
My problem happened in Managed C++/.NET code, meaning that I had easy access to the stack, which was a blessing. You seem to work on plain C++, so retrieving the stack could be a chore, but still, it remains very very useful.
You should try to load code from internet to print out the current stack for each log. I remember playing with http://www.codeproject.com/KB/threads/StackWalker.aspx for that.
Note that you'll need to either be in debug build, or have the PDB file along the executable file, to make sure the stack will be fully printed.
... everything, including multiple crashes
I believe you are on Windows: You could try to catch the SEH exception. This way, if multiple crashes are happening, you'll see them all, instead of seeing only the first, and each time you'll be able to mark "OK" or "CRASHED" in your logs. I went even as far as using maps to remember addresses of allocations/deallocations, thus organizing the logs to show them together (instead of sequentially).
I'm at home, so I can't provide you with the exact code, but here, Google is your friend, but the thing to remember is that you can't have a __try/__except handdler everywhere (C++ unwinding and C++ exception handlers are not compatible with SEH), so you'll have to write an intermediary function to catch the SEH exception.
Is your crash thread-related?
Last, but not least, the "I happens only 5% of the time" symptom could be caused by different code path executions, or the fact you have multiple threads playing together with the same data.
The InterlockedDecrement part bothers me: Is your object living in multiple threads? And is m_nRefCount correctly aligned and volatile LONG?
The correctly aligned and LONG part are important, here.
If your variable is not a LONG (for example, it could be a size_t, which is not a LONG on a 64-bit Windows), then the function could well work the wrong way.
The same can be said for a variable not aligned on 32-byte boundaries. Is there #pragma pack() instructions in your code? Does your projet file change the default alignment (I assume you're working on Visual Studio)?
For the volatile part, InterlockedDecrement seem to generate a Read/Write memory barrier, so the volatile part should not be mandatory (see http://msdn.microsoft.com/en-us/library/f20w0x5e.aspx).
I have a C++ application cross-compiled for Linux running on an ARM CortexA9 processor which is crashing with a SIGFPE/Arithmetic exception. Initially I thought that it's because of some optimizations introduced by the -O3 flag of gcc but then I built it in debug mode and it still crashes.
I debugged the application with gdb which catches the exception but unfortunately the operation triggering exception seems to also trash the stack so I cannot get any detailed information about the place in my code which causes that to happen. The only detail I could finally get was the operation triggering the exception(from the following piece of stack trace):
3 raise() 0x402720ac
2 __aeabi_uldivmod() 0x400bb0b8
1 __divsi3() 0x400b9880
The __aeabi_uldivmod() is performing an unsigned long long division and reminder so I tried the brute force approach and searched my code for places that might use that operation but without much success as it proved to be a daunting task. Also I tried to check for potential divisions by zero but again the code base it's pretty large and checking every division operation it's a cumbersome and somewhat dumb approach. So there must be a smarter way to figure out what's happening.
Are there any techniques to track down the causes of such exceptions when the debugger cannot do much to help?
UPDATE: After crunching on hex numbers, dumping memory and doing stack forensics(thanks Crashworks) I came across this gem in the ARM Compiler documentation(even though I'm not using the ARM Ltd. compiler):
Integer division-by-zero errors can be trapped and identified by
re-implementing the appropriate C library helper functions. The
default behavior when division by zero occurs is that when the signal
function is used, or
__rt_raise() or __aeabi_idiv0() are re-implemented, __aeabi_idiv0() is
called. Otherwise, the division function returns zero.
__aeabi_idiv0() raises SIGFPE with an additional argument, DIVBYZERO.
So I put a breakpoint at __aeabi_idiv0(_aeabi_ldiv0) et Voila!, I had my complete stack trace before being completely trashed. Thanks everybody for their very informative answers!
Disclaimer: the "winning" answer was chosen solely and subjectively taking into account the weight of its suggestions into my debugging efforts, because more than one was informative and really helpful.
My first suggestion would be to open a memory window looking at the region around your stack pointer, and go digging through it to see if you can find uncorrupted stack frames nearby that might give you a clue as to where the crash was. Usually stack-trashes only burn a couple of the stack frames, so if you look upwards a few hundred bytes, you can get past the damaged area and get a general sense of where the code was. You can even look down the stack, on the assumption that the dead function might have called some other function before it died, and thus there might be an old frame still in memory pointing back at the current IP.
In the comments, I linked some presentation slides that illustrate the technique on a PowerPC — look at around #73-86 for a case study in a similar botched-stack crash. Obviously your ARM's stack frames will be laid out differently, but the general principle holds.
(Using the basic idea from Fedor Skrynnikov, but with compiler help instead)
Compile your code with -pg. This will insert calls to mcount and mcountleave() in every function. Do not link against the GCC profiling lib, but provide your own. The only thing you want to do in your mcount and mcountleave() is to keep a copy of the current stack, so just copy the top 128 bytes or so of the stack to a fixed buffer. Both the stack and the buffer will be in cache all the time so it's fairly cheap.
You can implement special guards in functions that can cause the exception. Guard is a simple class, in constractor of this class you put the name of the file and line (_FILE_, _LINE_) into file/array/whatever. The main condition is that this storage should be the same for all instances of this class(kind of stack). In the destructor you remove this line. To make it works you need to put the creation of this guard on the first line of each function and to create it only on stack. When you will be out of current block deconstructor will be called. So in the moment of your exception you will know from this improvised callstack which function is causing a problem.
Ofcaurse you may put creation of this class under debug condition
Enable generation of core files, and open the core file with the debuger
Since it uses raise() to raise the exception, I would expect that signal() should be able to catch it. Is this not the case?
Alternatively, you can set a conditional breakpoint at __aeabi_uldivmod to break when divisor (r1) is 0.
This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
Common reasons for bugs in release version not present in debug mode
Sometimes I encouter such strange situations that the program run incorrectly while running normally and it will pop-up the termination dialog,but correctly while debugging.This do make me frustrated when I want to use debugger to find the bug inside my code.
Have you ever met this kind of situation and why?
Update:
To prove there are logic reasons that will led such a frustrating situation:
I think one big possibility is heap access volidation. I once wrote a function that allocate a small buffer, but later I step out the boudary. It will run correctly within gdb, cdb, etc (I do not know why, but it do run correctly); but terminate abnormally while running normally.
I am using C++.
I do not think my problem duplicate the above one.
That one is comparision between release mode and debug mode,but mine is between debugging and not debugging,which have a word heisenbug, as many other noted.
thanks.
You have a heisenbug.
Debugger might be initializing values
Some environments initialize variables and/or memory to known values like zero in debug builds but not release builds.
Release might be built with optimizations
Modern compilers are good, but it could hypothetically happen that optimized code functions differently than non-optimized code. Edit: These days, compiler bugs are rare. If you find yourself thinking you have one, exhaust all other ideas first.
There can be other reasons for heisenbugs.
Here's a common gotcha that can lead to a Heisenbug (love that name!):
// Sanity check - this should never fail
ASSERT( ReleaseResources() == SUCCESS);
In a debug build, this will work as expected, but the ASSERT macro's argument is ignored in a release build. By ignored, I mean that not only won't the result be reported, but the expression won't be evaluated at all (i.e. ReleaseResources() won't be called).
This is a common mistake, and it's why the Windows SDK defines a VERIFY() macro in addition to the ASSERT() macro. They both generate an assertion dialog at runtime in a debug build if the argument evaluates to false. Their behavior is different for a release build, however. Here's the difference:
ASSERT( foo() == true ); // Confirm that call to foo() was successful
VERIFY( bar() == true ); // Confirm that call to bar() was successful
In a debug build, the above two macros behave identically. In a release build, however, they are essentially equivalent to:
; // Confirm that call to foo() was successful
bar(); // Confirm that call to bar() was successful
By the way, if your environment defines an ASSERT() macro, but not a VERIFY() macro, you can easily define your own:
#ifdef _DEBUG
// DEBUG build: Define VERIFY simply as ASSERT
# define VERIFY(expr) ASSERT(expr)
#else
// RELEASE build: Define VERIFY as the expression, without any checking
# define VERIFY(expr) ((void)(expr))
#endif
Hope that helps.
Apparently stackoverflow won't let me post a response which contains only a single word :)
VALGRIND
When using a debugger, sometimes memory gets initialized (e.g. zero'ed) whereas without a debugging session, memory can be random. This could explain the behavior you are seeing.
You have dialogs, so there may be threads in your application. If there is threads, there is a possibility of race conditions.
Let say your main thread initialize a structure that another thread uses. When you run your program inside the debugger the initializing thread may be scheduled before the other thread while in your real-life situation the thread that use the structure is scheduled before the other thread actually initialize it.
In addition to what JeffH said, you have to consider if the deploying computer (or server) has the same environment/libraries/whatever_related_to_the_program.
Sometimes it's very difficult to debug correctly if you debug with other conditions.
Giovanni
Also, debuggers might add some padding around allocated memory changing the behaviour. This has caught me out a number of times, so you need to be aware of it. Getting the same memory behaviour in debug is important.
For MSVC, this can be disabled with the env-var _NO_DEBUG_HEAP=1. (The debug heap is slow, so this helps if your debug runs are hideously slow too..).
Another method to get the same is to start the process outside the debugger, so you get a normal startup, then wait on first line in main and attach the debugger to process. That should work for "any" system. provided that you don't crash before main. (You could wait on a ctor on a statically pre-mani constructed object then...)
But I've no experience with gcc/gdb in this matter, but things might be similar there... (Comments welcome.)
One real-world example of heisenbug from Raymand Zhang.
/*--------------------------------------------------------------
GdPage.cpp : a real example to illustrate Heisenberg Effect
related with guard page by Raymond Zhang, Oct. 2008
--------------------------------------------------------------*/
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
LPVOID lpvAddr; // address of the test memory
lpvAddr = VirtualAlloc(NULL, 0x4096,
MEM_RESERVE | MEM_COMMIT,
PAGE_READONLY | PAGE_GUARD);
if(lpvAddr == NULL)
{
printf("VirtualAlloc failed with %ld\n", GetLastError());
return -1;
}
return *(long *)lpvAddr;
}
The program would terminate abnormally whether compile with Debug or Release,because
by specifying the PAGE_GUARD flag would cause the:
Pages in the region become guard
pages. Any attempt to read from or
write to a guard page causes the
system to raise a STATUS_GUARD_PAGE
exception and turn off the guard page
status. Guard pages thus act as a
one-shot access alarm.
So you'd get STATUS_GUARD_PAGE while trying to access *lpvAddr.But if you use debugger load the program and watch *lpvAddv or step the last statement return *(long *)lpvAddr assembly by assembly,the debugger would forsee the guard page to determine the value of *lpvAddr.So the debugger would have cleared the guard alarm for us before we access *lpvAddr.
Which programming language are you using. Certain languages, such as C++, behave slightly differently between release and debug builds. In the case of C++, this means that when you declare a var, such as int i;, in debug builds it will be initialised to 0, while in release builds it may take any value (whatever was stored in its memory location before).
One big reason is that debug code may define the _DEBUG macro that one may use in the code to add extra stuff in debug builds.
For multithreaded code, optimization may affect ordering which may influence race conditions.
I do not know if debug code adds code on the stack to mark stack frames. Any extra stuff on the stack may hide the effects of buffer overruns.
Try using the same command options as your release build and just add the -g (or equivalent debug flag). gcc allows the debug option together with the optimization options.
If your logic depends on data from the system clock, you could see serious probe effects. If you break into the debugger, you will obviously effect the values returned from clock functions such as timeGetTime(). The same is true if your program takes longer to execute. As other people have said, debug builds insert NOOPs. Also, simply running under the debugger (without hitting breakpoints) might slow things down.
An example of where this might happen is a real-time physics simulation with a variable time step, based off elapsed system time. This is why there are articles like this:
http://gafferongames.com/game-physics/fix-your-timestep/
I'm working on a game and I'm currently working on the part that handles input. Three classes are involved here, there's the ProjectInstance class which starts the level and stuff, there's a GameController which will handle the input, and a PlayerEntity which will be influenced by the controls as determined by the GameController. Upon starting the level the ProjectInstance creates the GameController, and it will call its EvaluateControls method in the Step method, which is called inside the game loop. The EvaluateControls method looks a bit like this:
void CGameController::EvaluateControls(CInputBindings *pib) {
// if no player yet
if (gc_ppePlayer == NULL) {
// create it
Handle<CPlayerEntityProperties> hep = memNew(CPlayerEntityProperties);
gc_ppePlayer = (CPlayerEntity *)hep->SpawnEntity();
memDelete((CPlayerEntityProperties *)hep);
ASSERT(gc_ppePlayer != NULL);
return;
}
// handles controls here
}
This function is called correctly and the assert never triggers. However, every time this function is called, gc_ppePlayer is set to NULL. As you can see it's not a local variable going out of scope. The only place gc_ppePlayer can be set to NULL is in the constructor or possibly in the destructor, neither of which are being called in between the calls to EvaluateControls. When debugging, gc_ppePlayer receives a correct and expected value before the return. When I press F10 one more time and the cursor is at the closing brace, the value changes to 0xffffffff. I'm at a loss here, how can this happen? Anyone?
set a watch point on gc_ppePlayer == NULL when the value of that expression changes (to NULL or from NULL) the debugger will point you to exactly where it happened.
Try that and see what happens. Look for unterminated strings or mempcy copying into memory that is too small etc ... usually that is the cause of the problem of global/stack variables being overwritten randomly.
To add a watchpoint in VS2005 (instructions by brone)
Go to Breakpoints window
Click New,
Click Data breakpoint. Enter
&gc_ppePlayer in Address box, leave
other values alone.
Then run.
When gc_ppePlayer changes,
breakpoint
will be hit. – brone
Are you debugging a Release or Debug configuration? In release build configuration, what you see in the debugger isn't always true. Optimisations are made, and this can make the watch window show quirky values like you are seeing.
Are you actually seeing the ASSERT triggering? ASSERTs are normally compiled out of Release builds, so I'm guessing you are debugging a release build which is why the ASSERT isn't causing the application to terminate.
I would recommend build a Debug version of the software, and then seeing if gc_ppePlayer is really NULL. If it really is, maybe you are seeing memory heap corruption of some sort where this pointer is being overridden. But if it was memory corruption, it would generally be much less deterministic than you are describing.
As an aside, using global pointer values like this is generally considered bad practice. See if you can replace this with a singleton class if it is truly a single object and needs to be globally accessible.
My first thought is to say that SpawnEntity() is returning a pointer to an internal member that is getting "cleared" when memDelete() is called. It's not clear to me when the pointer is set to 0xffffffff, but if it occurs during the call to memDelete(), then this explains why your ASSERT is not firing - 0xffffffff is not the same as NULL.
How long has it been since you've rebuilt the entire code base? I've seen memory problems like this every now and again that are cleared up by simply rebuilding the entire solution.
Have you tried doing a step into (F11) instead of the step over (F10) at the end of the function? Although your example doesn't show any local variables, perhaps you left some out for the sake of simplicity. If so, F11 will (hopefully) step into the destructors for any of those variables, allowing you to see if one of them is causing the problem.
You have a "fandango on core."
The dynamic initialization is overwriting assorted bits (sic) of memory.
Either directly, or indirectly, the global is being overwritten.
where is the global in memory relative to the heap?
binary chop the dynamically initialized portion until the problem goes away.
(comment out half at a time, recursively)
Depending on what platform you are on there are tools (free or paid) that can quickly figure out this sort of memory issue.
Off the top of my head:
Valgrind
Rational Purify