We see a couple of below mentioned messages in /var/log/messages for one of our application:
Sep 18 03:24:23 <machine_name> kernel: application_name[14682] trap invalid opcode rip:f6c6e3ce rsp:ffc366bc error:0
...
Sep 18 03:19:35 <machine_name> kernel: application_name[4434] general protection rip:f6cd43a2 rsp:ffdfab0c error:7b2
I am not able to make what’s these output means and how we can track the function / code that is causing the issue. Further what is 'trap invalid opcode' and 'general protection' means?
Usually that means that your program's instruction pointer points to data or garbage. That's commonly caused by writing to stray pointers and such.
One scenario would be that your code writes (through a stray pointer) over some class' virtual table, replacing the member function addresses with nonsense. The next time you call one of the class' virtual functions, your program will interpret the garbage as an address and jump to that address. If whatever data lies at this address happens to not to be a valid machine code instruction for your processor, you would see this error.
There is another possibility that can cause 'invalid' op codes, that would be hardware not supporting newer opcode/instruction sets(SSE 4/5) or it not being from the right manufacturer(both AMD and Intel have some specific opcodes that work only on their processors) or just not having permission to exectute certain ops(though this would probably show up as something else).
From the above I would take RIP to be 'register(?) instruction pointer' and RSP to be 'register stack pointer', in which case you could use a debugger and set an execution hardware breakpoint on the specified address(RIP) and trace back what is calling it.(it seems your using linux or unix, so this is quite vague). if you are on windows, try using a custom exception filter to capture the EXCEPTION_ILLEGAL_INSTRUCTION event to get a little more information
Related
How to handle this exception?
__asm
{
mov esp, 0
mov eax, 0
div eax
}
This is not handled with try/except or SetUnhandledExceptionFilter().
Assuming this is running in an operating system, the operating system will catch the divide by zero, and then ATTEMPT to form an exception/signal stackframe for the application code. However, since the user-mode stack is "bad", it can't.
There is really no way for the operating system to deal with this, other than kill the application. [Theoretically, the could make up a new stack from some dynamically allocated memory, but it's pretty pointless, as there is no (always working) way for the application itself to recover to a sane state].
Don't set the stack pointer to something that isn't the stack - or if you do store "random" data in the stack pointer register, do not have exceptions. It's the same as "don't aim a gun at your foot and pull the trigger, unless you want to be without your foot".
Edit:
If the code is running in "kernel mode" rather than "usermode", it's even more "game over", since it will "double-fault" - the processor hits a divide by zero exception handler, which tries to write to the stack, and when it does so, it faults. This is now a "fault within a fault handler", aka a "double-fault". The typical setup of the double-fault handler is to have a separate stack, which then recovers the fault handler. But it's still game over - we don't know how to return to the original fault handler [or how to find out what the original fault handler was].
If there is no "new stack" with the double fault handler, it will triple fault a x86 processor - typically, a triple fault will make the processor restart [technically, it halts the processor with a special combination of bits signalled on the address bus to indicate that it's a "triple fault". The typical PC northbridge then resets the processor in recognition that the triple fault is an unrecoverable situation - this is why sometimes your PC simply reboots when you have poor quality drivers].
It's not a good idea to try to interact with a higher-level language's exception mechanism from embedded assembly. The compiler can do "magic" that you cannot match, and there's no (portable) way to tell the compiler that "this assembly code might throw an exception".
I'm compiling a program on remote linux server. The program compiled. However when I run it the program ends abruptly. So I debugged the program using DDT. It spits out the following error:
Process 0:
Memory error detected in ClassName::function (filename.cpp:6462).
Thread 1 attempted to dereference a null pointer or execute an SSE instruction with an
incorrectly aligned memory address (the latter may sometimes occur spuriously if guard
pages are enabled)
Tip: Use the stack list and the local variables to explore your program's current
state and identify the source of the error.
Can anyone please tell me what exactly this error means?
The line where the program stops looks like this:
SumUtility = ParaEst[0] + hhincome * ParaEst[71] + IsBlack * ParaEst[61] + IsBachAss * (ParaEst[55]);
This is within a switch case.
These are the variable types
vector<double> ParaEst;
double hhincome;
int IsBlack, Is BachAss;
Thanks for the help!
It means that:
ParaEst is NULL or a bad Pointer
ParaEst's individual array values are not aligned to 16-byte boundaries, required for SSE.
hhincome, IsBlack, or IsBachAss are not aligned to 16-byte boundaries and are SSE type values.
SumUtility is not aligned to 16-bytes and is a SSE type field.
If you could post the assembly code of the exact line that failed along with the register values of that assembler line, we could tell you exactly which of the above conditions have failed. It would also help to see the types of each variable shown to help narrow root the cause.
Ok... The problem finally got fixed.
The issue was that the expression where the code was breaking down was in a newly defined function. However for some weird reason running the make-file did not incorporate these changes and was still compiling using the previously compiled .o file. This resulted in garbage values being assigned to the variables within this new function. To top things off the program calls this function as a first step. Hence there was this systematic breakdown. The technical aspect of this was what Michael alluded to.
After this I would always recommend to use a make clean option in the make file. The issue of why running the make file is failing to compile the modified source file is an issue that definitely warrants further discussion.
Thanks for the responses!!
The last few days have been spent debugging a very strange problem. An application built for i386 running on Windows crashed, with the top of the callstack completely corrupted and the instruction pointer in a nonsense location.
After some effort, I rebuilt the callstack and was able to determine how the IP ended up in the nonsense location. An instruction in boost shared pointer code attempts to call a function defined in my DLLs import address table using an incorrect offset. The instruction looks like:
call dword ptr [nonsense offset into import address table]
As a result, execution ended up in a bad location that was, unfortunately, executable. Execution then proceeded, gobbling up the top of the stack until eventually crashing.
By launching the identical application on my PC, and stepping into the problematic code, I can find the same call instruction and see it's supposed to by calling msvc100's 'new' operator.
Further comparing the minidump from the client's PC to my PC, I found that my PC was calls a function with an offset of 0x0254 into the address table. On the clients PC, the code is trying to invoke a function with an offset of 0x8254.
What's even more confusing is that this offset is not coming from a register or another memory location. The offset is a constant in the disassembly. So, the disassembly looks like:
call dword ptr [ 0x50018254 ]
not like:
call dword ptr [ edx ]
Does anyone know how this might happen?
That's a single bit flip:
0x0254 = 0b0000001001010100
0x8254 = 0b1000001001010100
Perhaps corrupt memory, corrupt disk, gamma ray from the sun...?
If this specific case is reproducible and their on-disk binary matches yours, I'd investigate further. If it's not specifically reproducible, I'd encourage the client to run some machine diagnostics.
Thats seems to me a hardware error for sure, mainly memory error. As #Hostile_Fork pointed out, is just a bit flip.
Does your memory have ECC feature? it it does it, make sure is enabled. I would pass a burn-in memory test with memtest86 to see what happens, I bet you have a faulty memory chip, doesn't look like a bug.
Often I see ARM stack traces (read: Android NDK stack traces) that terminate with an lr pointer, like so:
#00 pc 001c6c20 /data/data/com.audia.dev.qt/lib/libQtGui.so
#01 lr 80a356cc /data/data/com.audia.dev.rta/lib/librta.so
I know that lr stands for link register on ARM and other architectures, and that it's a quick way to store a return address, but I don't understand why it always seems to store a useless address. In this example, 80a356cc cannot be mapped to any code using addr2line or gdb.
Is there any way to get more information? Why must the trace stop at the lr address anyway?
Stumbled on the answer finally. I just had to be more observant. Look at the following short stack trace and the information that comes after it:
#00 pc 000099d6 /system/lib/libandroid.so
#01 lr 80b6c17c /data/data/com.audia.dev.rta/lib/librta.so
code around pc:
a9d899b4 bf00bd0e 2102b507 aa016d43 28004798
a9d899c4 9801bfa8 bf00bd0e 460eb573 93004615
a9d899d4 6d842105 462b4632 200047a0 bf00bd7c
a9d899e4 b100b510 f7fe3808 2800edf4 f04fbf14
a9d899f4 200030ff bf00bd10 b097b5f0 4614af01
code around lr:
80b6c15c e51b3078 e5933038 e5932024 e51b302c
80b6c16c e1a00002 e3a01000 e3a02000 ebfeee5c
80b6c17c e1a03000 e50b303c e51b303c e1a03fa3
80b6c18c e6ef3073 e3530000 0a000005 e59f34fc
80b6c19c e08f3003 e1a00003 e51b103c ebfeebe6
Now the lr address is still a 80xxxxxx address that isn't useful to us.
The address it prints from the pc is 000099d6, but look at the next section, code around pc. The first column is a list of addresses (you can tell from the fact that it increments by 16 each time.) None of those addresses looks like the pc address, unless you chop off the first 16 bits. Then you'll notice that the a9d899d4 must correspond to 000099d4, and the code where the program stopped is two bytes in from that.
Android's stack trace seems to have "chopped off" the first 2 bytes of the pc address for me, but for whatever reason it does not do it for addresses in the leaf register. Which brings us to the solution:
In short, I was able to chop off the first 16 bits from the 80b6c17c address to make it 0000c17c, and so far that has given me a valid code address every time that I can look up with gdb or addr2line. (edit: I've found it's actually usually the first 12 bits or first 3 hexadecimal digits. You can decide this for yourself by looking at the stack trace output like I described above.) I can confirm that it is the right code address as well. This has definitely made debugging a great bit easier!
Do you have all debugging info (-g3) on?
Gcc likes to use the lr as a normal register. Remember that a non-leaf function looks like
push {lr}
; .. setup args here etc.
bl foo ; call a function foo
; .. work with function results
pop {pc}
Once it pushed lr to the stack, the compiler can use it almost freely - lr will be overwritten only by function calls. So its quite likely that there is any intermediate value in lr.
This should be stated in the debugging information that the compiler generates, in order to let the debugger know it has to look at the stack value instead of lr.
We are on HPUX and my code is in C++.
We are getting
BUS_ADRALN - Invalid address alignment
in our executable on a function call. What does this error means?
Same function is working many times then suddenly its giving core dump.
in GDB when I try to print the object values it says not in context.
Any clue where to check?
You are having a data alignment problem. This is likely caused by trying to read or write through a bad pointer of some kind.
A data alignment problem is when the address a pointer is pointing at isn't 'aligned' properly. For example, some architectures (the old Cray 2 for example) require that any attempt to read anything other than a single character from memory only occur through a pointer in which the last 3 bits of the pointer's value are 0. If any of the last 3 bits are 1, the hardware will generate an alignment fault which will result in the kind of problem you're seeing.
Most architectures are not nearly so strict, and frequently the required alignment depends on the exact type being accessed. For example, a 32 bit integer might require only the last 2 bits of the pointer to be 0, but a 64 bit float might require the last 3 bits to be 0.
Alignment problems are usually caused by the same kinds of problems that would cause a SEGFAULT or segmentation fault. Usually a pointer that isn't initialized. But it could be caused by a bad memory allocator that isn't returning pointers with the proper alignment, or by the result of pointer arithmetic on the pointer when it isn't of the correct type.
The system implementation of malloc and/or operator new are almost certainly correct or your program would be crashing way before it currently does. So I think the bad memory allocator is the least likely tree to go barking up. I would check first for an uninitialized pointer and then bad pointer arithmetic.
As a side note, the x86 and x86_64 architectures don't have any alignment requirements. But, because of how cache lines work, and for various other reasons, it's often a good idea for performance to align your data on a boundary that's as big as the datatype being stored (i.e. a 4 byte boundary for a 32 bit int).
Most processors (not x86 and friends.. the blacksheep of the family lol) require accesses to certain elements to be aligned on multiples of bytes. I.e. if you read an integer from address 0x04 that is okay, but if you try to do the same from 0x03 you will cause an interrupt to be thrown.
This is because it's easier to implement the load/store hardware if it's always on a multiple of the data size with which you're working.
Since HP-UX runs only on RISC processors, which typically have such constraints, you should see here -> http://en.wikipedia.org/wiki/Data_structure_alignment#RISC.
Actually HP-UX has its own great forum on ITRC and some HP staff members are very helpful. I just took a look at the same topic you are asking and here are some results. For example the similar problem was caused actually by a bad input parameter. I strongly advise you first to read answers to similar question and if necessary to post your question there.
By the way it is likely that you will be asked to post results of these gdb commands:
(gdb) bt
(gdb) info reg
(gdb) disas $pc-16*8 $pc+16*4
Most of these issues are caused by multiple upstream dependencies linking to different versions of the same library.
For example, both the gnustl and stlport provide distinct implementations of the C++ standard library. If you compile and link against gnustl, while one of your dependencies was compiled and linked against stlport, then you will each have a different implementation of standard functions and classes. When your program is launched, the dynamic linker will attempt to resolve all of the exported symbols, and will discover known symbols at incorrect offsets, resulting in the BUS_ADRALN signal.