Interpreting cause of segfault with GDB nexti

Interpreting cause of segfault with GDB nexti - c++

So, I'm debugging a program which mysteriously crashes via a SIGSEGV. The program is single-threaded.
I've debugged many segfaults before - most of them come down to stack or heap corruption. It's usually easy to debug heap corruption problems with valgrind. Stack corruption is trickier, but you can usually at least tell that stack corruption is the problem when GDB shows that your stack is mangled.
But, here I've encountered a very bizarre problem which I've never seen before. Using GDB to go instruction by instruction, I see that the segfault happens immediately after a callq instruction. Except the callq address is not dynamically loaded from a register or from memory - it's just a static function address:
(gdb) ni
0x00007ffff659c423 223 setPolicyDocumentLoader(docLoader);
1: x/i $pc
=> 0x7ffff659c423 <WebCore::FrameLoader::init()+351>: mov %rdx,%rsi
(gdb)
0x00007ffff659c426 223 setPolicyDocumentLoader(docLoader);
1: x/i $pc
=> 0x7ffff659c426 <WebCore::FrameLoader::init()+354>: mov %rax,%rdi
(gdb)
0x00007ffff659c429 223 setPolicyDocumentLoader(docLoader);
1: x/i $pc
=> 0x7ffff659c429 <WebCore::FrameLoader::init()+357>:
callq 0x7ffff53a2d50 <_ZN7WebCore11FrameLoader23setPolicyDocumentLoaderEPNS_14DocumentLoaderE#plt>
(gdb) ni
Program received signal SIGSEGV, Segmentation fault.
0x0000000000683670 in ?? ()
1: x/i $pc
=> 0x683670: add %al,(%rax)
(gdb)
So, as soon as it executes callq to the address 0x7ffff53a2d50, it suddenly segfaults.
I realize that, in general, Stackoverflow can't possibly be too helpful for most segfaults or problems like this, because the reasons tend to be extremely specific to some particular circumstance, and usually just come down to memory corruption via programmer error.
But I still thought it would be worth asking this question because this fundamentally doesn't even make any sense to me. How is it even possible for the OS to deliver a SIGSEGV when a program executes a callq instruction to a legitimate statically determined function address?

nexti will execute the next instruction, but if the instruction is a call then it executes until the function returns. From the GDB manual:
nexti, nexti arg, ni
Execute one machine instruction, but if it is a function call, proceed until the function returns. An argument is a repeat count, as in next.
When you do the callq the debugger called into that function but then crashes somewhere during execution of that function. If you want to step into a function call then I'd recommend stepi when you hit callq 0x7ffff53a2d50

as soon as it executes callq to the address 0x7ffff53a2d50, it suddenly segfaults.
This is usually caused by stack overflow.
Look for deep recursion (using where command). Also look at your stack region (containing current $rsp value) in output from info proc map.

Related

Why isn't a call command at each frame of disassemble except the last

I'm analyzing a core post-mortem via disassemble output of gdb. I'm new to this, so I'm still growing in my understanding of what I'm looking at. One immediate confusion for me is that as I go between frames and look at disassemble output, I don't see callq commands as the command being run as I would expect for all the non-frame 0 frames. Shouldn't each frame leading up to frame 0 be calling a function?
(gdb) f 0
(gdb) disassemble
...
=> 0x0000000001b0af10 <+16>: mov (%rdi),%rdx
...
End of assembler dump.
(gdb) info registers rdi
rdi 0x0 0
Makes sense: the crash happened due to a null ptr dereference. Now lets go up a fame and see the disassemble output there:
(gdb) up
(gdb) disassemble
...
=> 0x0000000001b1c01b <+315>: test %al,%al
...
What? The frame above is running test? Shouldn't it be calling the function disassembled in frame 0? What am I misunderstanding?
This is x64 assembly generated from GCC 4.8 compiling C++ code.

What am I misunderstanding?
On x86 (and x86_64), the CALL instruction pushes the address of the next instruction onto stack, and jumps to the called function.
When you go up, the current instruction is the one that will be executed after the frame you just stepped up from returns.
Do x/i $pc-5 if you want to see the actual CALL (note: the -5 works for most, but not all CALLs. See Peter Cordes comment below).

Array index out of bound, but gdb reports the wrong line - why?

I am a C++ beginner. I found a strange phenomenon. GDB can not give the line number of the root cause of error in this code.
#include <array>
using std::array;
int main(int argc, char **argv) {
array<double, 3> edgePoint1{0, 0, 0};
array<double, 3> edgePoint2{0, 0, 0};
array<double, 3> edgePoint3{0, 0, 0};
array<array<double, 3>, 3> edgePoints{};
edgePoints[0] = edgePoint1;
edgePoints[1] = edgePoint2;
edgePoints[3] = edgePoint3;
return 0;
}
The line 13 is the root of the problem. But when I use 'bt' in GBD, it print the line 15. Why?
Program received signal SIGABRT, Aborted.
0x00007f51f3133d7f in raise () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007f51f3133d7f in raise () from /usr/lib/libc.so.6
#1 0x00007f51f311e672 in abort () from /usr/lib/libc.so.6
#2 0x00007f51f3176878 in __libc_message () from /usr/lib/libc.so.6
#3 0x00007f51f3209415 in __fortify_fail_abort () from /usr/lib/libc.so.6
#4 0x00007f51f32093c6 in __stack_chk_fail () from /usr/lib/libc.so.6
#5 0x0000556e72f282b1 in main (argc=1, argv=0x7ffdc9299218) at /home/wzx/CLionProjects/work/test1.cpp:15
#6 0x0000000000000000 in ?? ()

The debugger diagnoses practical errors. Things that happen as a result of the mistakes/bugs in your code, after the very complicated process of translating source code into an actual program that the computer can run. It does not analyse C++ source for the mistakes/bugs in your code, nor is it actually theoretically capable of doing so (at least not in the general case). Here the practical error is that your buffer overrun corrupted the "stack". You only see reported that symptom, not the original cause (the buffer overrun itself).
It's a little like how if you accidentally steer your car off the road and smash it into a tree, the police know that you smashed your car into the tree, but they don't automatically know that this is because you had a stroke at the wheel, or because you were texting, or because you were drunk. They have to investigate to find out these details after the fact, using other (more indirect) pieces of evidence, such as interviewing you or performing a medical examination.
(Notice that the phone flew through the broken window and landed on the ground near the tree: it's nowhere near the driver's hand — even though the cause of the crash was that it was in the driver's hand. A good policeman will realise that the phone probably used to be inside the car, and based on the half-written text message displayed on its screen it was probably in the driver's hand at the time of the crash. Case closed, your honour. Solution: stop texting while driving.)
This is a fact of life with C++, which is why we need to pay careful attention to our code when writing it, so that we don't "shoot ourselves in the foot". Here you were very fortunate to get a crash at all, otherwise you may have missed the bug entirely and instead just seen unexpected/weird behaviours!
Over time, as you gain experience, you will become more accustomed to this, and get skilled at looking "around" or "near" the reported line to see what logical error led to the practical problem. It's mental pattern matching, for the most part. It's also kind of why you can't "learn C++ in 21 days"!
Some tools do exist to make this easier. Static Analysis tools can look at your code and sometimes spot when you've used an impossible array index. Containers (e.g. array and vector) can be implemented with additional bounds checking (for at() this is required; for op[] some implementations add it for convenience in debug mode). Combine tooling with experience for great success!

While it is true what Lightness Races in Orbit said, it is also true that, when you compile with debug info (i.e. using gcc/clang the -g option) the compiler emits line information allowing theoretically a debugger to associate each machine instruction with a source line number, even when compiling with -O3 where really fancy optimizations occur.
Said that, the explanation of why gdb tells you the program crashed on line 15 is simple: the crash really did not happen on line 13. It's enough to look at the stack backtrace [I compiled your program with gdb on Linux]:
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff7a24801 in __GI_abort () at abort.c:79
#2 0x00007ffff7a6d897 in __libc_message (action=action#entry=do_abort,
fmt=fmt#entry=0x7ffff7b9a988 "*** %s ***: %s terminated\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff7b18cd1 in __GI___fortify_fail_abort (need_backtrace=need_backtrace#entry=false,
msg=msg#entry=0x7ffff7b9a966 "stack smashing detected") at fortify_fail.c:33
#4 0x00007ffff7b18c92 in __stack_chk_fail () at stack_chk_fail.c:29
#5 0x00005555555547e2 in main (argc=1, argv=0x7fffffffdde8) at crash.cpp:15
As you can see in the frame #4, your program did not crash because of the buffer overflow but because of compiler's stack protector (function __stack_chk_fail)
Since that is not code written by you, but automatically emitted by the compiler precisely in order to detect such bugs, the line information cannot be real. The compiler just used line 15, because it is where your main() function ends and, of course, the place where, if you look at the disassembly code, the compiler emitted code using stack sentinels to detect stack corruption.
In order to see the whole picture even better, here is the disassembly code (just use disass /s main in gdb to see it):
13 edgePoints[3] = edgePoint3;
0x000000000000079e <+308>: lea rax,[rbp-0x50]
0x00000000000007a2 <+312>: mov esi,0x3
0x00000000000007a7 <+317>: mov rdi,rax
0x00000000000007aa <+320>: call 0x7e4 <std::array<std::array<double, 3ul>, 3ul>::operator[](unsigned long)>
0x00000000000007af <+325>: mov rcx,rax
0x00000000000007b2 <+328>: mov rax,QWORD PTR [rbp-0x70]
0x00000000000007b6 <+332>: mov rdx,QWORD PTR [rbp-0x68]
0x00000000000007ba <+336>: mov QWORD PTR [rcx],rax
0x00000000000007bd <+339>: mov QWORD PTR [rcx+0x8],rdx
0x00000000000007c1 <+343>: mov rax,QWORD PTR [rbp-0x60]
0x00000000000007c5 <+347>: mov QWORD PTR [rcx+0x10],rax
14 return 0;
0x00000000000007c9 <+351>: mov eax,0x0
15 }
0x00000000000007ce <+356>: mov rdx,QWORD PTR [rbp-0x8]
0x00000000000007d2 <+360>: xor rdx,QWORD PTR fs:0x28
0x00000000000007db <+369>: je 0x7e2 <main(int, char**)+376>
0x00000000000007dd <+371>: call 0x540 <__stack_chk_fail#plt>
0x00000000000007e2 <+376>: leave
As you can see, there are several instructions at line 15, clearly emitted by the compiler because the stack protector is enabled by default.
If you compile your program with -fno-stack-protector, it won't crash, [at least it doesn't on my machine with my compiler] but the actual stack corruption will be there, just producing unpredictable effects. In a bigger program, when stack corruption occurs, you could expect any kind of weird behavior much later than the moment when the corruption occurred. In other words, the stack protector is a very good thing and helps you by exposing the problem instead of hiding it, which is what would naturally happen without it.

The issue is in line:
edgePoints[3] = edgePoint3;
Probably a typo.
gdb cannot fail before, because the previous line might have been a valid instruction, except for the wrong index. The failure happens after this execution, when some code to check the state of the stack is triggered (depends on your compiler flags, you didn't give them).
At this stage, an undefined behavior (the mentioned line) already wrecked havoc and anything can happen. Due to the checks, at least you see that you have a stack problem.
You can add more checks for out of bound access on some compilers or with address sanitizers. They would have flagged the error without relying on gdb.

Analyzing core dump with stack corrupted

I am currently trying to debug a core in my C++ app. The customer has reported a SEGFAULT core with following thread list:
...Other threads go above here
3 Thread 0xf73a2b70 (LWP 2120) 0x006fa430 in __kernel_vsyscall ()
2 Thread 0x2291b70 (LWP 2212) 0x006fa430 in __kernel_vsyscall ()
* 1 Thread 0x218fb70 (LWP 2210) 0x00000000 in ?? ()
The thing that puzzles me is the thread that crashed which points 0x00000000. If I try to inspect backtrace, I get:
Thread 1 (Thread 0x1eeeb70 (LWP 27156)):
#0 0x00000000 in ?? ()
#1 0x00281da7 in SomeClass1::_someKnownMethod1 (this=..., elem=...) at path_to_cpp_file:line_number
#2 0x0028484d in SomeClass2::_someKnownMethod2 (this=..., stream=..., stanza=...) at path_to_cpp_file:line_number
#3 0x002958b2 in SomeClass3::_someKnownMethod3 (this=..., stream=..., elem=...) at path_to_cpp_file:line_number
I appologize about redaction - a limitations of NDA.
Obviously, the top frame is quite unknown. My first guess was that PC register got corrupted by some stack overwrite.
I have tried reproducting the issue in my local deployment by supplying the same call that was seen in Frame #1 but the crash never happened.
It is a known fact that these cores are very difficult to debug? But does anyone has some hint on what to try out?
Update
0x00281d8b <+171>: mov edx,DWORD PTR [ebp+0x8]
0x00281d8e <+174>: mov ecx,DWORD PTR [ebp+0xc]
0x00281d91 <+177>: mov eax,DWORD PTR [edx+0x8]
0x00281d94 <+180>: mov edx,DWORD PTR [eax]
0x00281d96 <+182>: mov DWORD PTR [esp+0x8],ecx
0x00281d9a <+186>: mov ecx,DWORD PTR [ebp+0x8]
0x00281d9d <+189>: mov DWORD PTR [esp],eax
0x00281da0 <+192>: mov DWORD PTR [esp+0x4],ecx
0x00281da4 <+196>: call DWORD PTR [edx+0x14]
=> 0x00281da7 <+199>: mov ebx,DWORD PTR [ebp-0xc]
0x00281daa <+202>: mov esi,DWORD PTR [ebp-0x8]
0x00281dad <+205>: mov edi,DWORD PTR [ebp-0x4]
0x00281db0 <+208>: mov esp,ebp
0x00281db2 <+210>: pop ebp
0x00281db3 <+211>: ret
0x00281db4 <+212>: lea esi,[esi+eiz*1+0x0]
... should have been the one from Frame #0, but from the disassembly this makes little sense. It is like the program has crashed while returning from Frame #1, but why am I seeing the invalid Frame #0? Or does this frame tear down part belongs to a function onPacket?
Update #2:
(gdb) p/x $edx
$5 = 0x1deb664
(gdb) print _listener
$6 = (jax::MyClass &) #0xf6dbf6c4: {_vptr.MyClass= 0x1deb664}

Expanding on Hayt's comment, since the rest of the stack looks fine, I'd suspect that something is going wrong in frame #1; consider the following (obviously incorrect) program, which generates a similar stack trace:
int main() {
void (*foo)() = 0;
foo();
return 0;
}
Stack Trace:
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x000000000040056a in main ()

If frame 1 does not make sense at a source level, you might try looking at disassembly of frame 1. After selecting that frame, disass $pc should show you the disassembly for the entire function, with => to indicate the return address (the instruction immediately after the call to frame 0).
In the case of a null function pointer dereference, the instruction for the call to frame 0 might involve a simple register dereference, in which case you'd want to understand how that register obtained the null value. In some cases including /m in a disass command can be helpful, although it can cause confusion because of the distinction between instruction boundaries and source line boundaries. Omitting /m is more likely to display a meaningful return address.
The => in the updated disassembly (without /m) makes sense. In any frame aside from frame 0, the pc value (what the => points at in the disassembly) indicates the instruction which will execute when the next lowest numbered frame returns (which, due to the crash, did not occur in this case). The pc value in frame 1 is not the value of the pc register at the time of the crash, but rather the saved pc value pushed on the stack by the call instruction. One way to see that is to compare output from x/a $sp in frame 0 to x/i $pc in frame 1.
One way to interpret this disassembly is that edx is some object, and [edx+0x14] points into its vtable. One way the vtable might wind up with a null pointer is a memory allocation issue with a stale reference to a chunk of memory which has been deallocated and subsequently overwritten by its rightful owner (the next piece of code to allocate that chunk). If any of that is applicable here, it can work either way (the code in frame 1 might be the culprit, or it might be the victim). There are other reasons memory might be overwritten with incorrect contents, but double allocation might be a good place to start.
It probably makes sense to examine the contents of the object referenced by edx in frame 1, to see if there are any other anomalies besides what could be an incorrect vtable. Both the print command and the x command (within gdb) can be useful for this. My best guess about which object is referenced by edx, based on disass/m output (at this writing, visible only in the edit history of the question), is _listener, but it would be best to confirm that by further study of the disassembly (the excerpt available here does not seem to include the instruction that determines the value of edx).

See also gdb can't access memory address error for the case (in one of the comments) where where rogue unmap unmapped memory for stacks of a few other threads and crashed with core dump pretty difficult to use.

My program causes segmentation faults in a Windows DLL - how can I debug it?

So I've written this relatively large (~6000 LOC) Qt application, which I use with Windows 7, and I'm getting these strange segmentation faults. Most of the time everything works as I expect it to, but sometimes, I get a segmentation fault in ole32.dll which is part of Windows. The disassembly looks like this:
Function: ole32!CoAddRefServerProcess
0x7601c98a <+0x02e3> or %cl,0x37890446(%ecx)
0x7601c990 <+0x02e9> call 0x760326a6 <ole32!ObjectStublessClient15+2849>
0x7601c995 <+0x02ee> xor %eax,%eax
0x7601c997 <+0x02f0> pop %esi
0x7601c998 <+0x02f1> pop %edi
0x7601c999 <+0x02f2> pop %ebx
0x7601c99a <+0x02f3> pop %ebp
0x7601c99b <+0x02f4> ret $0x4
0x7601c99e <+0x02f7> mov (%eax),%ecx
0x7601c9a0 <+0x02f9> push %eax
0x7601c9a1 <+0x02fa> call *0x8(%ecx)
0x7601c9a4 <+0x02fd> jmp 0x7604a6d9 <ole32!CoRevokeClassObject+16608>
0x7601c9a9 <+0x0302> test $0x2000000,%eax
0x7601c9ae <+0x0307> jne 0x76025dc1 <ole32!StgOpenStorage+5555>
0x7601c9b4 <+0x030d> jmp 0x7604a715 <ole32!CoRevokeClassObject+16668>
The fault always happens in the same dll function at the same place (the first mov command). There is no particular time or place in the application when the fault will occur, but it's more likely to strike when the program is collecting or saving data (it seems to be stable when it's doing nothing - I've left it running over night without crashing).
I'm kind of puzzled by this, since the segmentation happens in a piece of code I didn't write, and even though bugs in Windows are not unheard of, I guess it's more likely that my program is somehow messing up the memory that this dll uses and hence causes the problem. Can anyone suggest how to find the code that's really causing the problem?

MIPS core dump with ra and pc equal 0000000

I'm getting intermittent core dumps in one of our processes.
All of the threads' stacks, aside from the one which crashed, seem OK, and parsed correctly.
The thread that crashes has an apparently corrupted call stack.
The stack is has two frames, both of them 0x00000000.
Looking on the registers, both PC and RA are 0 (which explains the call stack...)
The cause register is 00800008.
Is there a way I can get more information on the crashed thread?
How come the registers themselves are corrupted? (Or is it the other way around, in core dump the debugger fills these registers based on the stack?)
Thanks!

To answer (2) first -- because understanding what actually happened is important for finding out more information about the root cause of the crash:
It really is the registers themselves, in the machine at runtime, that are 0; but it's not that the registers themselves got corrupted; rather, memory got corrupted, and that corrupted memory then got copied back into the registers, which finally caused the crash.
What's happening is something like this: the stack becomes corrupted, including (a) specifically the RA, while it is stored on the stack memory, gets zeroed out. Then, when the function is ready to return, it (b) restores the RA register from the stack -- so the RA register is now 0 -- and then (c) jump-returns to the RA, thus setting the PC to also point to 0; the next instruction will then cause a crash, while both the RA and PC are 0.
That business about the RA being stored on the stack and then restored from it is explained, for example, at http://logos.cs.uic.edu/366/notes/mips%20quick%20tutorial.htm (emphasis mine):
return address stored in register $ra; if subroutine will call other subroutines, or is
recursive, return address should be copied from $ra onto stack to preserve it,
since jal always places return address in this register and hence will overwrite
previous value.
Here's an example program which crashes with PC and RA both 0, and which illustrates the above sequence nicely (the exact numbers may have to be tweaked, depending on the system):
#include <string.h>
int bar(void)
{
char buf[10] = "ABCDEFGHI";
memset(buf, 0, 50);
return 0;
}
int foo(void)
{
return bar();
}
int main(int argc, char *argv[])
{
return foo();
}
And if we look at the disassembly of foo():
(gdb) disas foo
Dump of assembler code for function foo:
0x00400408 <+0>: addiu sp,sp,-32
0x0040040c <+4>: sw ra,28(sp)
0x00400410 <+8>: sw s8,24(sp)
0x00400414 <+12>: move s8,sp
0x00400418 <+16>: jal 0x4003a0 <bar>
0x0040041c <+20>: nop
0x00400420 <+24>: move sp,s8
0x00400424 <+28>: lw ra,28(sp)
0x00400428 <+32>: lw s8,24(sp)
0x0040042c <+36>: addiu sp,sp,32
0x00400430 <+40>: jr ra
0x00400434 <+44>: nop
End of assembler dump.
we see very nicely that RA gets stored on the stack at the beginning of the function (<+4> sw ra,28(sp)) and then is restored at the end (<+28> lw ra,28(sp)) and then jump-returned to (<+40> jr ra). I showed foo() because it's shorter, but the exact same structure is true for bar() -- except that in bar() there is also the memset() in the middle, which overwrites RA while it is on the stack (it's writing 50 bytes into an array of size 10); and then what gets restored into the register is 0, ultimately causing the crash.
So, now we understand that the root cause of the crash is some kind of stack corruption, which gets us back to question (1): is there any way way to get more information about the crashed thread?
Well, this is a bit more difficult, and is where debugging becomes more of an art than a science, but here are the principles to keep in mind:
The basic idea is to figure out what is causing the stack corruption -- most likely, it is a write to some local buffer, as in the example above.
Try to zero in as much as possible on where in the flow the corruption is occurring. Logging can help a lot here: the last log you see obviously happened before the crash (though not necessarily before the corruption!) -- add more logging in the suspect area to zero in on the crash location. Of course, if you have access to a debugger, you can also step through the code to figure out where it's crashing.
Once you find the crash location, it's much easier to work backwards from there: first of all, before the crash, the PC is not yet set to 0, and therefore you should be able to see a backtrace (though, note that the backtrace itself is "calculated" using the values stored on the stack -- once they are corrupted, the backtrace can't be calculated beyond the corruption. But this is actually helpful in this case: this can tell you quite precisely where in memory the corruption is: the point at which the backtrace is truncated is the RA (on the stack) which got corrupted.)
Once you have found what is being corrupted, but you still don't know what is causing the corruption, use watchpoints: as soon as you enter the function which places the RA that is ultimately overwritten on the stack, set a watchpoint on it. That should cause a break as soon as the corruption occurs...
Hope this helps!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js