Recently I ran into a crash while the following statement is getting executed
static const float kDefaultTolerance = DoubleToFloat(0.25);
where DoubleToFloat is defined as below
static inline float DoubleToFloat(double x){
return static_cast<float>(x);
}
And the log statements shows below
09-04 01:08:50.727 882 882 F DEBUG : signal 4 (SIGILL), code 2 (ILL_ILLOPN), fault addr 0x7f9e3ca96818
when I read about SIGILL, I understand that it happens when process encounters to run an invalid operation. So I think compiler (clang in my case) is generating some junk code while translating the above snippet. How to check what is compiler generating and see what is going wrong in this particular case? Also suggest me if there are any tools to debug these kind of issues.
I have a similar problem today.Finally, I found the reason for the problem is that AVX instruction set is used in floating-point operation, but the computer does not support AVX instruction set. You can try to use SSE Instruction set.
I am a C++ beginner. I found a strange phenomenon. GDB can not give the line number of the root cause of error in this code.
#include <array>
using std::array;
int main(int argc, char **argv) {
array<double, 3> edgePoint1{0, 0, 0};
array<double, 3> edgePoint2{0, 0, 0};
array<double, 3> edgePoint3{0, 0, 0};
array<array<double, 3>, 3> edgePoints{};
edgePoints[0] = edgePoint1;
edgePoints[1] = edgePoint2;
edgePoints[3] = edgePoint3;
return 0;
}
The line 13 is the root of the problem. But when I use 'bt' in GBD, it print the line 15. Why?
Program received signal SIGABRT, Aborted.
0x00007f51f3133d7f in raise () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007f51f3133d7f in raise () from /usr/lib/libc.so.6
#1 0x00007f51f311e672 in abort () from /usr/lib/libc.so.6
#2 0x00007f51f3176878 in __libc_message () from /usr/lib/libc.so.6
#3 0x00007f51f3209415 in __fortify_fail_abort () from /usr/lib/libc.so.6
#4 0x00007f51f32093c6 in __stack_chk_fail () from /usr/lib/libc.so.6
#5 0x0000556e72f282b1 in main (argc=1, argv=0x7ffdc9299218) at /home/wzx/CLionProjects/work/test1.cpp:15
#6 0x0000000000000000 in ?? ()
The debugger diagnoses practical errors. Things that happen as a result of the mistakes/bugs in your code, after the very complicated process of translating source code into an actual program that the computer can run. It does not analyse C++ source for the mistakes/bugs in your code, nor is it actually theoretically capable of doing so (at least not in the general case). Here the practical error is that your buffer overrun corrupted the "stack". You only see reported that symptom, not the original cause (the buffer overrun itself).
It's a little like how if you accidentally steer your car off the road and smash it into a tree, the police know that you smashed your car into the tree, but they don't automatically know that this is because you had a stroke at the wheel, or because you were texting, or because you were drunk. They have to investigate to find out these details after the fact, using other (more indirect) pieces of evidence, such as interviewing you or performing a medical examination.
(Notice that the phone flew through the broken window and landed on the ground near the tree: it's nowhere near the driver's hand — even though the cause of the crash was that it was in the driver's hand. A good policeman will realise that the phone probably used to be inside the car, and based on the half-written text message displayed on its screen it was probably in the driver's hand at the time of the crash. Case closed, your honour. Solution: stop texting while driving.)
This is a fact of life with C++, which is why we need to pay careful attention to our code when writing it, so that we don't "shoot ourselves in the foot". Here you were very fortunate to get a crash at all, otherwise you may have missed the bug entirely and instead just seen unexpected/weird behaviours!
Over time, as you gain experience, you will become more accustomed to this, and get skilled at looking "around" or "near" the reported line to see what logical error led to the practical problem. It's mental pattern matching, for the most part. It's also kind of why you can't "learn C++ in 21 days"!
Some tools do exist to make this easier. Static Analysis tools can look at your code and sometimes spot when you've used an impossible array index. Containers (e.g. array and vector) can be implemented with additional bounds checking (for at() this is required; for op[] some implementations add it for convenience in debug mode). Combine tooling with experience for great success!
While it is true what Lightness Races in Orbit said, it is also true that, when you compile with debug info (i.e. using gcc/clang the -g option) the compiler emits line information allowing theoretically a debugger to associate each machine instruction with a source line number, even when compiling with -O3 where really fancy optimizations occur.
Said that, the explanation of why gdb tells you the program crashed on line 15 is simple: the crash really did not happen on line 13. It's enough to look at the stack backtrace [I compiled your program with gdb on Linux]:
(gdb) bt
#0 __GI_raise (sig=sig#entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007ffff7a24801 in __GI_abort () at abort.c:79
#2 0x00007ffff7a6d897 in __libc_message (action=action#entry=do_abort,
fmt=fmt#entry=0x7ffff7b9a988 "*** %s ***: %s terminated\n") at ../sysdeps/posix/libc_fatal.c:181
#3 0x00007ffff7b18cd1 in __GI___fortify_fail_abort (need_backtrace=need_backtrace#entry=false,
msg=msg#entry=0x7ffff7b9a966 "stack smashing detected") at fortify_fail.c:33
#4 0x00007ffff7b18c92 in __stack_chk_fail () at stack_chk_fail.c:29
#5 0x00005555555547e2 in main (argc=1, argv=0x7fffffffdde8) at crash.cpp:15
As you can see in the frame #4, your program did not crash because of the buffer overflow but because of compiler's stack protector (function __stack_chk_fail)
Since that is not code written by you, but automatically emitted by the compiler precisely in order to detect such bugs, the line information cannot be real. The compiler just used line 15, because it is where your main() function ends and, of course, the place where, if you look at the disassembly code, the compiler emitted code using stack sentinels to detect stack corruption.
In order to see the whole picture even better, here is the disassembly code (just use disass /s main in gdb to see it):
13 edgePoints[3] = edgePoint3;
0x000000000000079e <+308>: lea rax,[rbp-0x50]
0x00000000000007a2 <+312>: mov esi,0x3
0x00000000000007a7 <+317>: mov rdi,rax
0x00000000000007aa <+320>: call 0x7e4 <std::array<std::array<double, 3ul>, 3ul>::operator[](unsigned long)>
0x00000000000007af <+325>: mov rcx,rax
0x00000000000007b2 <+328>: mov rax,QWORD PTR [rbp-0x70]
0x00000000000007b6 <+332>: mov rdx,QWORD PTR [rbp-0x68]
0x00000000000007ba <+336>: mov QWORD PTR [rcx],rax
0x00000000000007bd <+339>: mov QWORD PTR [rcx+0x8],rdx
0x00000000000007c1 <+343>: mov rax,QWORD PTR [rbp-0x60]
0x00000000000007c5 <+347>: mov QWORD PTR [rcx+0x10],rax
14 return 0;
0x00000000000007c9 <+351>: mov eax,0x0
15 }
0x00000000000007ce <+356>: mov rdx,QWORD PTR [rbp-0x8]
0x00000000000007d2 <+360>: xor rdx,QWORD PTR fs:0x28
0x00000000000007db <+369>: je 0x7e2 <main(int, char**)+376>
0x00000000000007dd <+371>: call 0x540 <__stack_chk_fail#plt>
0x00000000000007e2 <+376>: leave
As you can see, there are several instructions at line 15, clearly emitted by the compiler because the stack protector is enabled by default.
If you compile your program with -fno-stack-protector, it won't crash, [at least it doesn't on my machine with my compiler] but the actual stack corruption will be there, just producing unpredictable effects. In a bigger program, when stack corruption occurs, you could expect any kind of weird behavior much later than the moment when the corruption occurred. In other words, the stack protector is a very good thing and helps you by exposing the problem instead of hiding it, which is what would naturally happen without it.
The issue is in line:
edgePoints[3] = edgePoint3;
Probably a typo.
gdb cannot fail before, because the previous line might have been a valid instruction, except for the wrong index. The failure happens after this execution, when some code to check the state of the stack is triggered (depends on your compiler flags, you didn't give them).
At this stage, an undefined behavior (the mentioned line) already wrecked havoc and anything can happen. Due to the checks, at least you see that you have a stack problem.
You can add more checks for out of bound access on some compilers or with address sanitizers. They would have flagged the error without relying on gdb.
my long running application crashes randomly with segmentation fault. When trying to debug the generated coredump, I get stuck with wierd stacktrace:
(gdb) bt full
#0 __memmove_ssse3 () at ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:2582
No locals.
#1 0x00000000 in ?? ()
No symbol table info available.
How it can happen, that the backtrace starts at 0x00000000?
What can I do to debug this issue more? I can't run it in gdb as it may take even a week till the crash occures.
Generally this means that the return address on the stack has been overwritten with 0, probably due to overrunning the end of an on-stack array. You can trying building with address sanitizer on gcc or clang (if you are using them). Or you can try running with valgrind to see if it will tell you about invalid memory writes.
I am working on a large fortran code and before to compile with fast options (in order to perform test on large database), I usually compile with "warnings" options in order to detect and backtrace all the problems.
So with the gfortran -fbacktrace -ffpe-trap=invalid,zero,overflow,underflow -Wall -fcheck=all -ftrapv -g2 compilation, I get the following error:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x7fec64cdfef7 in ???
#1 0x7fec64cdf12d in ???
#2 0x7fec6440e4af in ???
#3 0x7fec64a200b4 in ???
#4 0x7fec649dc5ce in ???
#5 0x4cf93a in __f_mod_MOD
at /f_mod.f90:132
#6 0x407d55 in main_loop_
at main.f90:419
#7 0x40cf5c in main_prog
at main.f90:180
#8 0x40d5d3 in main
at main.f90:68
And the portion of the code f_mod.f90:132 is containing a where loop:
! Compute s parameter
do i = 1, Imax
where (dprim .ne. 1.0)
s(:,:,:, :) = s(:,:,:, :) +vprim(:,:,:, i,:)*dprim(:,:,:, :)*dprim(:,:,:, :)/(1.0 -dprim(:,:,:, :))
endwhere
enddo
But I do not see any mistake here. All the other locations are the calls of the subroutine leading to this part. And of course, since it is a SIGFPE error, I have to problem at the execution when I compile gfortran -g1. (I use gfortran 6.4.0 on linux)
Moreover, this error appears and disappears with the modifications of completely different part of the code. Thus, the problem comes from this where loop ? Or from somewhere else and the backtrace is wrong ? If it is the case how can I find this mistake?
EDIT:
Since, I can not reproduce this error in a minimal example (they are working), I think that the problem comes for somewhere else. But how to find the problem in a large code ?
As the code is dying with a SIGFPE, use each of the individual
possible traps to learn if it is a FE_DIVBYZERO, FE_INVALID,
FE_OVERFLOW, or FE_UNDERFLOW. If it is an underflow, change
your mask to '1 - dprim .ne. 0'.
PS: Don't use array section notation when a whole array reference
can be used instead.
PPS: You may want to compute dprim*drpim / (1 - dprim) outside
of the do-loop as it is loop invariant.
I'm on an ARM Cortex M0 (Nordic NRF51822) using the Segger JLink. When my code hard faults (say due to a dereferencing an invalid pointer), I see only the following stack trace:
(gdb) bt
#0 HardFault_HandlerC (hardfault_args=<optimized out>) at main_display.cpp:440
#1 0x00011290 in ?? ()
I have a hard fault handler installed and it can give me the lr and pc:
(gdb) p/x stacked_pc
$1 = 0x18ea6
(gdb) p/x stacked_lr
$2 = 0x18b35
And I know I can use addr-to-line to translate these to source code lines:
> arm-none-eabi-addr2line -e main_display.elf 0x18ea6
/Users/cmason/code/nrf/src/../libs/epaper/EPD_Display.cpp:33
> arm-none-eabi-addr2line -e main_display.elf 0x18b35
/Users/cmason/code/nrf/src/../libs/epaper/EPD.cpp:414
Can I get the rest of the backtrace somehow? If I stop at a normal breakpoint I can get a backtrace, so I know GDB can do the (somewhat complex) algorithm to unwind the stack on ARM. I understand that, in the general case, the stack may be screwed up by my code to the point where it's unreadable, but I don't think that's whats happening in this case.
I think this may be complicated by Nordic's memory protection scheme. Their bluetooth stack installs its own interrupt vector and prevents access to certain memory regions. Or maybe this is Segger's fault? On other examples of Cortex M0 do most people see regular back traces from hard faults?
Thanks!
-c
Cortex-M0 and Cortex-M3 is close enough that you can use the answer from this question:
Stack Backtrace for ARM core using GCC compiler (when there is a MSP to PSP switch)
in short: GCC has a function _Unwind_Backtrace to generate a full call stack; this needs to be hacked up a bit to simulate doing a backtrace from before the exception entry happened. Details in the linked question.