Analyzing core dump with stack corrupted - c++

I am currently trying to debug a core in my C++ app. The customer has reported a SEGFAULT core with following thread list:
...Other threads go above here
3 Thread 0xf73a2b70 (LWP 2120) 0x006fa430 in __kernel_vsyscall ()
2 Thread 0x2291b70 (LWP 2212) 0x006fa430 in __kernel_vsyscall ()
* 1 Thread 0x218fb70 (LWP 2210) 0x00000000 in ?? ()
The thing that puzzles me is the thread that crashed which points 0x00000000. If I try to inspect backtrace, I get:
Thread 1 (Thread 0x1eeeb70 (LWP 27156)):
#0 0x00000000 in ?? ()
#1 0x00281da7 in SomeClass1::_someKnownMethod1 (this=..., elem=...) at path_to_cpp_file:line_number
#2 0x0028484d in SomeClass2::_someKnownMethod2 (this=..., stream=..., stanza=...) at path_to_cpp_file:line_number
#3 0x002958b2 in SomeClass3::_someKnownMethod3 (this=..., stream=..., elem=...) at path_to_cpp_file:line_number
I appologize about redaction - a limitations of NDA.
Obviously, the top frame is quite unknown. My first guess was that PC register got corrupted by some stack overwrite.
I have tried reproducting the issue in my local deployment by supplying the same call that was seen in Frame #1 but the crash never happened.
It is a known fact that these cores are very difficult to debug? But does anyone has some hint on what to try out?
Update
0x00281d8b <+171>: mov edx,DWORD PTR [ebp+0x8]
0x00281d8e <+174>: mov ecx,DWORD PTR [ebp+0xc]
0x00281d91 <+177>: mov eax,DWORD PTR [edx+0x8]
0x00281d94 <+180>: mov edx,DWORD PTR [eax]
0x00281d96 <+182>: mov DWORD PTR [esp+0x8],ecx
0x00281d9a <+186>: mov ecx,DWORD PTR [ebp+0x8]
0x00281d9d <+189>: mov DWORD PTR [esp],eax
0x00281da0 <+192>: mov DWORD PTR [esp+0x4],ecx
0x00281da4 <+196>: call DWORD PTR [edx+0x14]
=> 0x00281da7 <+199>: mov ebx,DWORD PTR [ebp-0xc]
0x00281daa <+202>: mov esi,DWORD PTR [ebp-0x8]
0x00281dad <+205>: mov edi,DWORD PTR [ebp-0x4]
0x00281db0 <+208>: mov esp,ebp
0x00281db2 <+210>: pop ebp
0x00281db3 <+211>: ret
0x00281db4 <+212>: lea esi,[esi+eiz*1+0x0]
... should have been the one from Frame #0, but from the disassembly this makes little sense. It is like the program has crashed while returning from Frame #1, but why am I seeing the invalid Frame #0? Or does this frame tear down part belongs to a function onPacket?
Update #2:
(gdb) p/x $edx
$5 = 0x1deb664
(gdb) print _listener
$6 = (jax::MyClass &) #0xf6dbf6c4: {_vptr.MyClass= 0x1deb664}

Expanding on Hayt's comment, since the rest of the stack looks fine, I'd suspect that something is going wrong in frame #1; consider the following (obviously incorrect) program, which generates a similar stack trace:
int main() {
void (*foo)() = 0;
foo();
return 0;
}
Stack Trace:
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x000000000040056a in main ()

If frame 1 does not make sense at a source level, you might try looking at disassembly of frame 1. After selecting that frame, disass $pc should show you the disassembly for the entire function, with => to indicate the return address (the instruction immediately after the call to frame 0).
In the case of a null function pointer dereference, the instruction for the call to frame 0 might involve a simple register dereference, in which case you'd want to understand how that register obtained the null value. In some cases including /m in a disass command can be helpful, although it can cause confusion because of the distinction between instruction boundaries and source line boundaries. Omitting /m is more likely to display a meaningful return address.
The => in the updated disassembly (without /m) makes sense. In any frame aside from frame 0, the pc value (what the => points at in the disassembly) indicates the instruction which will execute when the next lowest numbered frame returns (which, due to the crash, did not occur in this case). The pc value in frame 1 is not the value of the pc register at the time of the crash, but rather the saved pc value pushed on the stack by the call instruction. One way to see that is to compare output from x/a $sp in frame 0 to x/i $pc in frame 1.
One way to interpret this disassembly is that edx is some object, and [edx+0x14] points into its vtable. One way the vtable might wind up with a null pointer is a memory allocation issue with a stale reference to a chunk of memory which has been deallocated and subsequently overwritten by its rightful owner (the next piece of code to allocate that chunk). If any of that is applicable here, it can work either way (the code in frame 1 might be the culprit, or it might be the victim). There are other reasons memory might be overwritten with incorrect contents, but double allocation might be a good place to start.
It probably makes sense to examine the contents of the object referenced by edx in frame 1, to see if there are any other anomalies besides what could be an incorrect vtable. Both the print command and the x command (within gdb) can be useful for this. My best guess about which object is referenced by edx, based on disass/m output (at this writing, visible only in the edit history of the question), is _listener, but it would be best to confirm that by further study of the disassembly (the excerpt available here does not seem to include the instruction that determines the value of edx).

See also gdb can't access memory address error for the case (in one of the comments) where where rogue unmap unmapped memory for stacks of a few other threads and crashed with core dump pretty difficult to use.

Related

segmentation fault after linking c++ file with asm file [duplicate]

I am currently learning x86 assembly. Something is not clear to me still however when using the stack for function calls. I understand that the call instruction will involve pushing the return address on the stack and then load the program counter with the address of the function to call. The ret instruction will load this address back to the program counter.
My confusion is, does it matter when the ret instruction is called within the procedure/function? Will it always find the correct return address stored on the stack, or must the stack pointer be currently pointing to where the return address was stored? If that's the case, can't we just use push and pop instead of call and ret?
For example, the code below could be the first on entering the function , if we push different registers on the stack, must the ret instruction only be called after the registers are popped in the reverse order so that after the pop %ebp instruction , the stack pointer will point to the correct place on the stack where the return address is, or will it still find it regardless where it is called? Thanks in advance
push %ebp
mov %ebp, %esp
//push other registers
...
//pop other registers
mov %esp, %ebp
(could ret instruction go here for example and still pop the correct return address?)
pop %ebp
ret
You must leave the stack and non-volatile registers as you found them. The calling function has no clue what you might have done with them otherwise - the calling function will simply continue to its next instruction after ret. Only ret after you're done cleaning up.
ret will always look to the top of the stack for its return address and will pop it into EIP. If the ret is a "far" return then it will also pop the code segment into the CS register (which would also have been pushed by call for a "far" call). Since these are the first things pushed by call, they must be the last things popped by ret. Otherwise you'll end up reting somewhere undefined.
The CPU has no idea what is function/etc... The ret instruction will fetch value from memory pointed to by esp a jump there. For example you can do things like (to illustrate the CPU is not interested into how you structurally organize your source code):
; slow alternative to "jmp continue_there_address"
push continue_there_address
ret
continue_there_address:
...
Also you don't need to restore the registers from stack, (not even restore them to the original registers), as long as esp points to the return address when ret is executed, it will be used:
call SomeFunction
...
SomeFunction:
push eax
push ebx
push ecx
add esp,8 ; forget about last 2 push
pop ecx ; ecx = original eax
ret ; returns back after call
If your function should be interoperable from other parts of code, you may still want to store/restore the registers as required by the calling convention of the platform you are programming for, so from the caller point of view you will not modify some register value which should be preserved, etc... but none of that bothers CPU and executing instruction ret, the CPU just loads value from stack ([esp]), and jumps there.
Also when the return address is stored to stack, it does not differ from other values pushed to stack in any way, all of them are just values written in memory, so the ret has no chance to somehow find "return address" in stack and skip "values", for CPU the values in memory look the same, each 32 bit value is that, 32 bit value. Whether it was stored by call, push, mov, or something else, doesn't matter, that information (origin of value) is not stored, only value.
If that's the case, can't we just use push and pop instead of call and ret?
You can certainly push preferred return address into stack (my first example). But you can't do pop eip, there's no such instruction. Actually that's what ret does, so pop eip is effectively the same thing, but no x86 assembly programmer use such mnemonics, and the opcode differs from other pop instructions. You can of course pop the return address into different register, like eax, and then do jmp eax, to have slow ret alternative (modifying also eax).
That said, the complex modern x86 CPUs do keep some track of call/ret pairings (to predict where the next ret will return, so it can prefetch the code ahead quickly), so if you will use one of those alternative non-standard ways, at some point the CPU will realize it's prediction system for return address is off the real state, and it will have to drop all those caches/preloads and re-fetch everything from real eip value, so you may pay performance penalty for confusing it.
In the example code, if the return was done before pop %ebp, it would attempt to return to the "address" that was in ebp at the start of the function, which would be the wrong address to return to.

Why isn't a call command at each frame of disassemble except the last

I'm analyzing a core post-mortem via disassemble output of gdb. I'm new to this, so I'm still growing in my understanding of what I'm looking at. One immediate confusion for me is that as I go between frames and look at disassemble output, I don't see callq commands as the command being run as I would expect for all the non-frame 0 frames. Shouldn't each frame leading up to frame 0 be calling a function?
(gdb) f 0
(gdb) disassemble
...
=> 0x0000000001b0af10 <+16>: mov (%rdi),%rdx
...
End of assembler dump.
(gdb) info registers rdi
rdi 0x0 0
Makes sense: the crash happened due to a null ptr dereference. Now lets go up a fame and see the disassemble output there:
(gdb) up
(gdb) disassemble
...
=> 0x0000000001b1c01b <+315>: test %al,%al
...
What? The frame above is running test? Shouldn't it be calling the function disassembled in frame 0? What am I misunderstanding?
This is x64 assembly generated from GCC 4.8 compiling C++ code.
What am I misunderstanding?
On x86 (and x86_64), the CALL instruction pushes the address of the next instruction onto stack, and jumps to the called function.
When you go up, the current instruction is the one that will be executed after the frame you just stepped up from returns.
Do x/i $pc-5 if you want to see the actual CALL (note: the -5 works for most, but not all CALLs. See Peter Cordes comment below).

Array index out of bound, but gdb reports the wrong line - why?

I am a C++ beginner. I found a strange phenomenon. GDB can not give the line number of the root cause of error in this code.
#include <array>
using std::array;
int main(int argc, char **argv) {
array<double, 3> edgePoint1{0, 0, 0};
array<double, 3> edgePoint2{0, 0, 0};
array<double, 3> edgePoint3{0, 0, 0};
array<array<double, 3>, 3> edgePoints{};
edgePoints[0] = edgePoint1;
edgePoints[1] = edgePoint2;
edgePoints[3] = edgePoint3;
return 0;
}
The line 13 is the root of the problem. But when I use 'bt' in GBD, it print the line 15. Why?
Program received signal SIGABRT, Aborted.
0x00007f51f3133d7f in raise () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007f51f3133d7f in raise () from /usr/lib/libc.so.6
#1 0x00007f51f311e672 in abort () from /usr/lib/libc.so.6
#2 0x00007f51f3176878 in __libc_message () from /usr/lib/libc.so.6
#3 0x00007f51f3209415 in __fortify_fail_abort () from /usr/lib/libc.so.6
#4 0x00007f51f32093c6 in __stack_chk_fail () from /usr/lib/libc.so.6
#5 0x0000556e72f282b1 in main (argc=1, argv=0x7ffdc9299218) at /home/wzx/CLionProjects/work/test1.cpp:15
#6 0x0000000000000000 in ?? ()
The debugger diagnoses practical errors. Things that happen as a result of the mistakes/bugs in your code, after the very complicated process of translating source code into an actual program that the computer can run. It does not analyse C++ source for the mistakes/bugs in your code, nor is it actually theoretically capable of doing so (at least not in the general case). Here the practical error is that your buffer overrun corrupted the "stack". You only see reported that symptom, not the original cause (the buffer overrun itself).
It's a little like how if you accidentally steer your car off the road and smash it into a tree, the police know that you smashed your car into the tree, but they don't automatically know that this is because you had a stroke at the wheel, or because you were texting, or because you were drunk. They have to investigate to find out these details after the fact, using other (more indirect) pieces of evidence, such as interviewing you or performing a medical examination.
(Notice that the phone flew through the broken window and landed on the ground near the tree: it's nowhere near the driver's hand — even though the cause of the crash was that it was in the driver's hand. A good policeman will realise that the phone probably used to be inside the car, and based on the half-written text message displayed on its screen it was probably in the driver's hand at the time of the crash. Case closed, your honour. Solution: stop texting while driving.)
This is a fact of life with C++, which is why we need to pay careful attention to our code when writing it, so that we don't "shoot ourselves in the foot". Here you were very fortunate to get a crash at all, otherwise you may have missed the bug entirely and instead just seen unexpected/weird behaviours!
Over time, as you gain experience, you will become more accustomed to this, and get skilled at looking "around" or "near" the reported line to see what logical error led to the practical problem. It's mental pattern matching, for the most part. It's also kind of why you can't "learn C++ in 21 days"!
Some tools do exist to make this easier. Static Analysis tools can look at your code and sometimes spot when you've used an impossible array index. Containers (e.g. array and vector) can be implemented with additional bounds checking (for at() this is required; for op[] some implementations add it for convenience in debug mode). Combine tooling with experience for great success!
While it is true what Lightness Races in Orbit said, it is also true that, when you compile with debug info (i.e. using gcc/clang the -g option) the compiler emits line information allowing theoretically a debugger to associate each machine instruction with a source line number, even when compiling with -O3 where really fancy optimizations occur.
Said that, the explanation of why gdb tells you the program crashed on line 15 is simple: the crash really did not happen on line 13. It's enough to look at the stack backtrace [I compiled your program with gdb on Linux]:
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff7a24801 in __GI_abort () at abort.c:79
#2 0x00007ffff7a6d897 in __libc_message (action=action#entry=do_abort,
fmt=fmt#entry=0x7ffff7b9a988 "*** %s ***: %s terminated\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff7b18cd1 in __GI___fortify_fail_abort (need_backtrace=need_backtrace#entry=false,
msg=msg#entry=0x7ffff7b9a966 "stack smashing detected") at fortify_fail.c:33
#4 0x00007ffff7b18c92 in __stack_chk_fail () at stack_chk_fail.c:29
#5 0x00005555555547e2 in main (argc=1, argv=0x7fffffffdde8) at crash.cpp:15
As you can see in the frame #4, your program did not crash because of the buffer overflow but because of compiler's stack protector (function __stack_chk_fail)
Since that is not code written by you, but automatically emitted by the compiler precisely in order to detect such bugs, the line information cannot be real. The compiler just used line 15, because it is where your main() function ends and, of course, the place where, if you look at the disassembly code, the compiler emitted code using stack sentinels to detect stack corruption.
In order to see the whole picture even better, here is the disassembly code (just use disass /s main in gdb to see it):
13 edgePoints[3] = edgePoint3;
0x000000000000079e <+308>: lea rax,[rbp-0x50]
0x00000000000007a2 <+312>: mov esi,0x3
0x00000000000007a7 <+317>: mov rdi,rax
0x00000000000007aa <+320>: call 0x7e4 <std::array<std::array<double, 3ul>, 3ul>::operator[](unsigned long)>
0x00000000000007af <+325>: mov rcx,rax
0x00000000000007b2 <+328>: mov rax,QWORD PTR [rbp-0x70]
0x00000000000007b6 <+332>: mov rdx,QWORD PTR [rbp-0x68]
0x00000000000007ba <+336>: mov QWORD PTR [rcx],rax
0x00000000000007bd <+339>: mov QWORD PTR [rcx+0x8],rdx
0x00000000000007c1 <+343>: mov rax,QWORD PTR [rbp-0x60]
0x00000000000007c5 <+347>: mov QWORD PTR [rcx+0x10],rax
14 return 0;
0x00000000000007c9 <+351>: mov eax,0x0
15 }
0x00000000000007ce <+356>: mov rdx,QWORD PTR [rbp-0x8]
0x00000000000007d2 <+360>: xor rdx,QWORD PTR fs:0x28
0x00000000000007db <+369>: je 0x7e2 <main(int, char**)+376>
0x00000000000007dd <+371>: call 0x540 <__stack_chk_fail#plt>
0x00000000000007e2 <+376>: leave
As you can see, there are several instructions at line 15, clearly emitted by the compiler because the stack protector is enabled by default.
If you compile your program with -fno-stack-protector, it won't crash, [at least it doesn't on my machine with my compiler] but the actual stack corruption will be there, just producing unpredictable effects. In a bigger program, when stack corruption occurs, you could expect any kind of weird behavior much later than the moment when the corruption occurred. In other words, the stack protector is a very good thing and helps you by exposing the problem instead of hiding it, which is what would naturally happen without it.
The issue is in line:
edgePoints[3] = edgePoint3;
Probably a typo.
gdb cannot fail before, because the previous line might have been a valid instruction, except for the wrong index. The failure happens after this execution, when some code to check the state of the stack is triggered (depends on your compiler flags, you didn't give them).
At this stage, an undefined behavior (the mentioned line) already wrecked havoc and anything can happen. Due to the checks, at least you see that you have a stack problem.
You can add more checks for out of bound access on some compilers or with address sanitizers. They would have flagged the error without relying on gdb.

Why would fclose hang / deadlock? (Windows)

I have a directory change monitor process that reads updates from files within a set of directories. I have another process that performs small writes to a lot of files to those directories (test program). Figure about 100 directories with 10 files in each, and about 500 files being modified per second.
After running for a while, the directory monitor process hangs on a call to fclose() in a method that is basically tailing the file. In this method, I fopen() the file, check that the handle is valid, do a few seeks and reads, and then call fclose(). These reads are all performed by the same thread in the process. After the hang, the thread never progresses.
I couldn't find any good information on why fclose() might deadlock instead of returning some kind of error code. The documentation does mention _fclose_nolock(), but it doesn't seem to be available to me (Visual Studio 2003).
The hang occurs for both debug and release builds. In a debug build, I can see that fclose() calls _free_base(), which hangs before returning. Some kind of call into kernel32.dll => ntdll.dll => KernelBase.dll => ntdll.dll is spinning. Here's the assembly from ntdll.dll that loops indefinitely:
77CEB83F cmp dword ptr [edi+4Ch],0
77CEB843 lea esi,[ebx-8]
77CEB846 je 77CEB85E
77CEB848 mov eax,dword ptr [edi+50h]
77CEB84B xor dword ptr [esi],eax
77CEB84D mov al,byte ptr [esi+2]
77CEB850 xor al,byte ptr [esi+1]
77CEB853 xor al,byte ptr [esi]
77CEB855 cmp byte ptr [esi+3],al
77CEB858 jne 77D19A0B
77CEB85E mov eax,200h
77CEB863 cmp word ptr [esi],ax
77CEB866 ja 77CEB815
77CEB868 cmp dword ptr [edi+4Ch],0
77CEB86C je 77CEB87E
77CEB86E mov al,byte ptr [esi+2]
77CEB871 xor al,byte ptr [esi+1]
77CEB874 xor al,byte ptr [esi]
77CEB876 mov byte ptr [esi+3],al
77CEB879 mov eax,dword ptr [edi+50h]
77CEB87C xor dword ptr [esi],eax
77CEB87E mov ebx,dword ptr [ebx+4]
77CEB881 lea eax,[edi+0C4h]
77CEB887 cmp ebx,eax
77CEB889 jne 77CEB83F
Any ideas what might be happening here?
I posted this as a comment, but I realize this could be an answer in its own right...
Based on the disassembly, my guess is you've overwritten some internal heap structure maintained by ntdll, and it is looping forever iterating through a linked list.
In particular at the start of the loop, the current list node seems to be in ebx. At the end of the loop, the expected last node (or terminator, if you like -- it looks a bit like these are circular lists and the last node is the same as the first, pointer to this node being at [edi+4Ch]) is contained in eax. Probably the result of cmp ebx, eax is never equal, because there is some cycle in the list introduced by a heap corruption.
I don't think this has anything to do with locks, otherwise we would see some atomic instructions (eg. lock cmpxchg, xchg, etc.) or calls to other synchronization functions.
I had a same case with file close function. In my case, I solved by located the close function embedded other function body instead of having own function.
I was also suspicious on
(1) the name of file being duplicated (2) Windows scheduling (file IO wasn't completed before next task treading being started. Windows scheduling and multi-threading is behind of the curtain, so it is hard to verify, but I have similar issue when I tried to save many data in ASCII in the loop. Saving on binary solved at this case.)
My environment, IDE: Visual Studio 2015, OS: Windows 7, language: C++

Doubting about the Threads window of visual studio

As you can see above , there are 4 win32 threads at exactly the same location, how to understand it?
UPDATE
7C92E4BE mov dword ptr [esp],eax
7C92E4C1 mov dword ptr [esp+4],0
7C92E4C9 mov dword ptr [esp+8],0
7C92E4D1 mov dword ptr [esp+10h],0
7C92E4D9 push esp
7C92E4DA call 7C92E508
7C92E4DF mov eax,dword ptr [esp]
7C92E4E2 mov esp,ebp
7C92E4E4 pop ebp
7C92E4E5 ret
7C92E4E6 lea esp,[esp]
7C92E4ED lea ecx,[ecx]
7C92E4F0 mov edx,esp
7C92E4F2 sysenter
7C92E4F4 ret
At a guess, they're probably sleeping in something like WaitForSingleObject or similar.
The debugger shows the next ring3 processor instruction that is going to be executed. In this case the thread has called sysenter, which makes a ring0 system call to the operating system's kernel. This kernel system call is waiting for something to happen before returning control back to the calling code. Once that something happens, then it will call the next user-mode instruction, which in this case is ret.
If you have 4 threads that are all calling the same function that waits for a system call at the same location, you will have 4 threads that show the same address in the Threads window. This is something that you will see quite often in applications built with the Windows subsystem, which usually have a number of threads that are started by the Windows API that spend most of their time waiting for kernel events.
At a guess, you have a thread pool of some sort, so you have four threads all executing the same thread function. In this case, all four are mostly likely idle, waiting for a task they need to execute. If that's the case, it's quite sensible that all four show the same location.
You'll need to ignore the threads that are started by Microsoft code. I'm guessing at mmsys or DirectX from your screen shot. Microsoft code is very thread-happy.
You can get better diagnostics about what they do when you enable the Microsoft Symbol Server. You'll get decent names in the Call Stack window, often letting you guess what their purpose is. Of course, you'll never get to look at their code.