The last few days have been spent debugging a very strange problem. An application built for i386 running on Windows crashed, with the top of the callstack completely corrupted and the instruction pointer in a nonsense location.
After some effort, I rebuilt the callstack and was able to determine how the IP ended up in the nonsense location. An instruction in boost shared pointer code attempts to call a function defined in my DLLs import address table using an incorrect offset. The instruction looks like:
call dword ptr [nonsense offset into import address table]
As a result, execution ended up in a bad location that was, unfortunately, executable. Execution then proceeded, gobbling up the top of the stack until eventually crashing.
By launching the identical application on my PC, and stepping into the problematic code, I can find the same call instruction and see it's supposed to by calling msvc100's 'new' operator.
Further comparing the minidump from the client's PC to my PC, I found that my PC was calls a function with an offset of 0x0254 into the address table. On the clients PC, the code is trying to invoke a function with an offset of 0x8254.
What's even more confusing is that this offset is not coming from a register or another memory location. The offset is a constant in the disassembly. So, the disassembly looks like:
call dword ptr [ 0x50018254 ]
not like:
call dword ptr [ edx ]
Does anyone know how this might happen?
That's a single bit flip:
0x0254 = 0b0000001001010100
0x8254 = 0b1000001001010100
Perhaps corrupt memory, corrupt disk, gamma ray from the sun...?
If this specific case is reproducible and their on-disk binary matches yours, I'd investigate further. If it's not specifically reproducible, I'd encourage the client to run some machine diagnostics.
Thats seems to me a hardware error for sure, mainly memory error. As #Hostile_Fork pointed out, is just a bit flip.
Does your memory have ECC feature? it it does it, make sure is enabled. I would pass a burn-in memory test with memtest86 to see what happens, I bet you have a faulty memory chip, doesn't look like a bug.
Related
While looking through the assembly for a console "hello world" program (compiled using the visual c++ compiler), I came across this:
pre_c_init proc near
.text:00401AFE mov eax, 5A4Dh
.text:00401B03 cmp ds:400000h, ax
The code above seems to be accessing memory that isn't filled with anything in particular: All segments start at 0x401000 or even further down in the file. (The image base is at 0x400000, but the first segment is at 0x401000).
I used OllyDbg to see what the actual value at 0x400000 is, and every single time it's the same as in the code (0x5A4D). What's going on here?
5A4D is "MZ" in little-endian ASCII, and MZ is the signature of MS-DOS and, more recently, PE executables.
The comparison checks whether the executable has been mapped at the default base address, 0x400000. This, I believe, is used to determine whether it is necessary to perform relocation.
This is discussed further in the following thread: Why does PE need a relocation table?
Often I see ARM stack traces (read: Android NDK stack traces) that terminate with an lr pointer, like so:
#00 pc 001c6c20 /data/data/com.audia.dev.qt/lib/libQtGui.so
#01 lr 80a356cc /data/data/com.audia.dev.rta/lib/librta.so
I know that lr stands for link register on ARM and other architectures, and that it's a quick way to store a return address, but I don't understand why it always seems to store a useless address. In this example, 80a356cc cannot be mapped to any code using addr2line or gdb.
Is there any way to get more information? Why must the trace stop at the lr address anyway?
Stumbled on the answer finally. I just had to be more observant. Look at the following short stack trace and the information that comes after it:
#00 pc 000099d6 /system/lib/libandroid.so
#01 lr 80b6c17c /data/data/com.audia.dev.rta/lib/librta.so
code around pc:
a9d899b4 bf00bd0e 2102b507 aa016d43 28004798
a9d899c4 9801bfa8 bf00bd0e 460eb573 93004615
a9d899d4 6d842105 462b4632 200047a0 bf00bd7c
a9d899e4 b100b510 f7fe3808 2800edf4 f04fbf14
a9d899f4 200030ff bf00bd10 b097b5f0 4614af01
code around lr:
80b6c15c e51b3078 e5933038 e5932024 e51b302c
80b6c16c e1a00002 e3a01000 e3a02000 ebfeee5c
80b6c17c e1a03000 e50b303c e51b303c e1a03fa3
80b6c18c e6ef3073 e3530000 0a000005 e59f34fc
80b6c19c e08f3003 e1a00003 e51b103c ebfeebe6
Now the lr address is still a 80xxxxxx address that isn't useful to us.
The address it prints from the pc is 000099d6, but look at the next section, code around pc. The first column is a list of addresses (you can tell from the fact that it increments by 16 each time.) None of those addresses looks like the pc address, unless you chop off the first 16 bits. Then you'll notice that the a9d899d4 must correspond to 000099d4, and the code where the program stopped is two bytes in from that.
Android's stack trace seems to have "chopped off" the first 2 bytes of the pc address for me, but for whatever reason it does not do it for addresses in the leaf register. Which brings us to the solution:
In short, I was able to chop off the first 16 bits from the 80b6c17c address to make it 0000c17c, and so far that has given me a valid code address every time that I can look up with gdb or addr2line. (edit: I've found it's actually usually the first 12 bits or first 3 hexadecimal digits. You can decide this for yourself by looking at the stack trace output like I described above.) I can confirm that it is the right code address as well. This has definitely made debugging a great bit easier!
Do you have all debugging info (-g3) on?
Gcc likes to use the lr as a normal register. Remember that a non-leaf function looks like
push {lr}
; .. setup args here etc.
bl foo ; call a function foo
; .. work with function results
pop {pc}
Once it pushed lr to the stack, the compiler can use it almost freely - lr will be overwritten only by function calls. So its quite likely that there is any intermediate value in lr.
This should be stated in the debugging information that the compiler generates, in order to let the debugger know it has to look at the stack value instead of lr.
We see a couple of below mentioned messages in /var/log/messages for one of our application:
Sep 18 03:24:23 <machine_name> kernel: application_name[14682] trap invalid opcode rip:f6c6e3ce rsp:ffc366bc error:0
...
Sep 18 03:19:35 <machine_name> kernel: application_name[4434] general protection rip:f6cd43a2 rsp:ffdfab0c error:7b2
I am not able to make what’s these output means and how we can track the function / code that is causing the issue. Further what is 'trap invalid opcode' and 'general protection' means?
Usually that means that your program's instruction pointer points to data or garbage. That's commonly caused by writing to stray pointers and such.
One scenario would be that your code writes (through a stray pointer) over some class' virtual table, replacing the member function addresses with nonsense. The next time you call one of the class' virtual functions, your program will interpret the garbage as an address and jump to that address. If whatever data lies at this address happens to not to be a valid machine code instruction for your processor, you would see this error.
There is another possibility that can cause 'invalid' op codes, that would be hardware not supporting newer opcode/instruction sets(SSE 4/5) or it not being from the right manufacturer(both AMD and Intel have some specific opcodes that work only on their processors) or just not having permission to exectute certain ops(though this would probably show up as something else).
From the above I would take RIP to be 'register(?) instruction pointer' and RSP to be 'register stack pointer', in which case you could use a debugger and set an execution hardware breakpoint on the specified address(RIP) and trace back what is calling it.(it seems your using linux or unix, so this is quite vague). if you are on windows, try using a custom exception filter to capture the EXCEPTION_ILLEGAL_INSTRUCTION event to get a little more information
I have written a "dangerous" program in C++ that jumps back and forth from one stack frame to another. The goal is to be jump from the lowest level of a call stack to a caller, do something, and then jump back down again, each time skipping all the calls inbetween.
I do this by manually changing the stack base address (setting %ebp) and jumping to a label address. It totally works, with gcc and icc both, without any stack corruption at all. The day this worked was a cool day.
Now I'm taking the same program and re-writing it in C, and it doesn't work. Specifically, it doesn't work with gcc v4.0.1 (Mac OS). Once I jump to the new stack frame (with the stack base pointer set correctly), the following instructions execute, being just before a call to fprintf. The last instruction listed here crashes, dereferencing NULL:
lea 0x18b8(%ebx), %eax
mov (%eax), %eax
mov (%eax), %eax
I've done some debugging, and I've figured out that by setting the %ebx register manually when I switch stack frames (using a value I observed before leaving the function in the first place), I fix the bug. I've read that this register deals with "position independent code" in gcc.
What is position independent code? How does position independent code work? To what is this register pointing?
EBX points to the Global Offset Table. See this reference about PIC on i386. The link explains what PIC is an how EBX is used.
PIC is code that is relocated dynamically when it is loaded. Code that is non-PIC has jump and call addresses set at link time. PIC has a table that references all the places where such values exist, much like a .dll.
When the image is loaded, the loader will dynamically update those values. Other schemes reference a data value that defines a "base" and the target address is decided by performing calculations on the base. The base is usually set by the loader again.
Finally, other schemes use various trampolines that call to known relative offsets. The relative offsets contain code and/or data that are updated by a loader.
There are different reasons why different schemes are chosen. Some are fast when run, but slower to load. Some are fast to load, but have less runtime performance.
I have a C++ tool that walks the call stack at one point. In the code, it first gets a copy of the live CPU registers (via RtlCaptureContext()), then uses a few "#ifdef ..." blocks to save the CPU-specific register names into stackframe.AddrPC.Offset, ...AddrStack..., and ...AddrFrame...; also, for each of the 3 Addr... members above, it sets stackframe.Addr....Mode = AddrModeFlat. (This was borrowed from some example code I came across a while back.)
With an x86 binary, this works great. With an x64 binary, though, StackWalk64() passes back bogus addresses. (The first time the API is called, the only blatantly bogus address value appears in AddrReturn ( == 0xFFFFFFFF'FFFFFFFE -- aka StackWalk64()'s 3rd arg, the pseudo-handle returned by GetCurrentThread()). If the API is called a second time, however, all Addr... variables receive bogus addresses.) This happens regardless of how AddrFrame is set:
using either of the recommended x64 "base/frame pointer" CPU registers: rbp (= 0xf), or rdi (= 0x0)
using rsp (didn't expect it to work, but tried it anyway)
setting AddrPC and AddrStack normally, but leaving AddrFrame zeroed out (seen in other example code)
zeroing out all Addr... values, to let StackWalk64() fill them in from the passed-in CPU-register context (seen in other example code)
FWIW, the physical stack buffer's contents are also different on x64 vs. x86 (after accounting for different pointer widths & stack buffer locations, of course). Regardless of the reason, StackWalk64() should still be able to walk the call stack correctly -- heck, the debugger is still able to walk the call stack, and it appears to use StackWalk64() itself behind the scenes. The oddity there is that the (correct) call stack reported by the debugger contains base-address & return-address pointer values whose constituent bytes don't actually exist in the stack buffer (below or above the current stack pointer).
(FWIW #2: Given the stack-buffer strangeness above, I did try disabling ASLR (/dynamicbase:no) to see if it made a difference, but the binary still exhibited the same behavior.)
So. Any ideas why this would work fine on x86, but have problems on x64? Any suggestions on how to fix it?
Given that fs.sf is a STACKFRAME64 structure, you need to initialize it like this before passing it to StackWalk64: (c is a CONTEXT structure)
DWORD machine = IMAGE_FILE_MACHINE_AMD64;
RtlCaptureContext (&c);
fs.sf.AddrPC.Offset = c.Rip;
fs.sf.AddrFrame.Offset = c.Rsp;
fs.sf.AddrStack.Offset = c.Rsp;
fs.sf.AddrPC.Mode = AddrModeFlat;
fs.sf.AddrFrame.Mode = AddrModeFlat;
fs.sf.AddrStack.Mode = AddrModeFlat;
This code is taken from ACE (Adaptive Communications Environment), adapted from the StackWalker project on CodeProject.
SymInitialize(process, nullptr, TRUE) must be called (once) before StackWalk64().
FWIW, I've switched to using CaptureStackBackTrace(), and now it works just fine.