How to debug crash, when backtrace starts with zero - c++

my long running application crashes randomly with segmentation fault. When trying to debug the generated coredump, I get stuck with wierd stacktrace:
(gdb) bt full
#0 __memmove_ssse3 () at ../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:2582
No locals.
#1 0x00000000 in ?? ()
No symbol table info available.
How it can happen, that the backtrace starts at 0x00000000?
What can I do to debug this issue more? I can't run it in gdb as it may take even a week till the crash occures.

Generally this means that the return address on the stack has been overwritten with 0, probably due to overrunning the end of an on-stack array. You can trying building with address sanitizer on gcc or clang (if you are using them). Or you can try running with valgrind to see if it will tell you about invalid memory writes.

Related

core dump on malloc_consolidate () from /lib64/libc.so.6 [duplicate]

I usually love good explained questions and answers. But in this case I really can't give any more clues.
The question is: why malloc() is giving me SIGSEGV? The debug bellow show the program has no time to test the returned pointer to NULL and exit. The program quits INSIDE MALLOC!
I'm assuming my malloc in glibc is just fine. I have a debian/linux wheezy system, updated, in an old pentium (i386/i486 arch).
To be able to track, I generated a core dump. Lets follow it:
iguana$gdb xadreco core-20131207-150611.dump
Core was generated by `./xadreco'.
Program terminated with signal 11, Segmentation fault.
#0 0xb767fef5 in ?? () from /lib/i386-linux-gnu/libc.so.6
(gdb) bt
#0 0xb767fef5 in ?? () from /lib/i386-linux-gnu/libc.so.6
#1 0xb76824bc in malloc () from /lib/i386-linux-gnu/libc.so.6
#2 0x080529c3 in enche_pmovi (cabeca=0xbfd40de0, pmovi=0x...) at xadreco.c:4519
#3 0x0804b93a in geramov (tabu=..., nmovi=0xbfd411f8) at xadreco.c:1473
#4 0x0804e7b7 in minimax (atual=..., deep=1, alfa=-105000, bet...) at xadreco.c:2778
#5 0x0804e9fa in minimax (atual=..., deep=0, alfa=-105000, bet...) at xadreco.c:2827
#6 0x0804de62 in compjoga (tabu=0xbfd41924) at xadreco.c:2508
#7 0x080490b5 in main (argc=1, argv=0xbfd41b24) at xadreco.c:604
(gdb) frame 2
#2 0x080529c3 in enche_pmovi (cabeca=0xbfd40de0, pmovi=0x ...) at xadreco.c:4519
4519 movimento *paux = (movimento *) malloc (sizeof (movimento));
(gdb) l
4516
4517 void enche_pmovi (movimento **cabeca, movimento **pmovi, int c0, int c1, int c2, int c3, int p, int r, int e, int f, int *nmovi)
4518 {
4519 movimento *paux = (movimento *) malloc (sizeof (movimento));
4520 if (paux == NULL)
4521 exit(1);
Of course I need to look at frame 2, the last on stack related to my code. But the line 4519 gives SIGSEGV! It does not have time to test, on line 4520, if paux==NULL or not.
Here it is "movimento" (abbreviated):
typedef struct smovimento
{
int lance[4]; //move in integer notation
int roque; // etc. ...
struct smovimento *prox;// pointer to next
} movimento;
This program can load a LOT of memory. And I know the memory is in its limits. But I thought malloc would handle better when memory is not available.
Doing a $free -h during execution, I can see memory down to as low as 1MB! Thats ok. The old computer only has 96MB. And 50MB is used by the OS.
I don't know to where start looking. Maybe check available memory BEFORE a malloc call? But that sounds a wast of computer power, as malloc would supposedly do that. sizeof (movimento) is about 48 bytes. If I test before, at least I'll have some confirmation of the bug.
Any ideas, please share. Thanks.
Any crash inside malloc (or free) is an almost sure sign of heap corruption, which can come in many forms:
overflowing or underflowing a heap buffer
freeing something twice
freeing a non-heap pointer
writing to freed block
etc.
These bugs are very hard to catch without tool support, because the crash often comes many thousands of instructions, and possibly many calls to malloc or free later, in code that is often in a completely different part of the program and very far from where the bug is.
The good news is that tools like Valgrind or AddressSanitizer usually point you straight at the problem.

How does `gdb` compute the bounds of a stack frame?

I am debugging a new thread library, in which I set the stack register rsp manually (to switch to a user-managed stack), and then invoke a function which never returns.
When I try to get a backtrace in gdb, I get the following output.
(gdb) bt
#0 load (_m=std::memory_order_seq_cst, this=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>)
at /usr/include/c++/4.9/atomic:209
#1 Arachne::schedulerMainLoop () at Arachne.cc:236
#2 0x000000000040268d in Arachne::threadMainFunction (id=<optimized out>) at Arachne.cc:135
#3 0x0000000000000000 in ?? ()
How does gdb determine that the stack has 0 elements in it?
More generally, how does gdb determine how many elements the stack has?
Asked for position 0 of stack, stack only has 0 elements on it. comes from gdb/dwarf2/expr.c and doesn't refer to your CPU stack. Instead, it complains about the stack that DWARF expressions work on.
This might mean that your DWARF debug info is invalid. Or what you did with esp confused GDB's DWARF expression evaluator.

GDB backtrace from hard fault handler on ARM cortex M0 NRF51822

I'm on an ARM Cortex M0 (Nordic NRF51822) using the Segger JLink. When my code hard faults (say due to a dereferencing an invalid pointer), I see only the following stack trace:
(gdb) bt
#0 HardFault_HandlerC (hardfault_args=<optimized out>) at main_display.cpp:440
#1 0x00011290 in ?? ()
I have a hard fault handler installed and it can give me the lr and pc:
(gdb) p/x stacked_pc
$1 = 0x18ea6
(gdb) p/x stacked_lr
$2 = 0x18b35
And I know I can use addr-to-line to translate these to source code lines:
> arm-none-eabi-addr2line -e main_display.elf 0x18ea6
/Users/cmason/code/nrf/src/../libs/epaper/EPD_Display.cpp:33
> arm-none-eabi-addr2line -e main_display.elf 0x18b35
/Users/cmason/code/nrf/src/../libs/epaper/EPD.cpp:414
Can I get the rest of the backtrace somehow? If I stop at a normal breakpoint I can get a backtrace, so I know GDB can do the (somewhat complex) algorithm to unwind the stack on ARM. I understand that, in the general case, the stack may be screwed up by my code to the point where it's unreadable, but I don't think that's whats happening in this case.
I think this may be complicated by Nordic's memory protection scheme. Their bluetooth stack installs its own interrupt vector and prevents access to certain memory regions. Or maybe this is Segger's fault? On other examples of Cortex M0 do most people see regular back traces from hard faults?
Thanks!
-c
Cortex-M0 and Cortex-M3 is close enough that you can use the answer from this question:
Stack Backtrace for ARM core using GCC compiler (when there is a MSP to PSP switch)
in short: GCC has a function _Unwind_Backtrace to generate a full call stack; this needs to be hacked up a bit to simulate doing a backtrace from before the exception entry happened. Details in the linked question.

Regarding Possible Lost in Valgrind

What is wrong if we push the strings into vector like this:
globalstructures->schema.columnnames.push_back("id");
When i am applied valgrind on my code it is showing
possibly lost of 27 bytes in 1 blocks are possibly lost in loss record 7 of 19.
like that in so many places it is showing possibly lost.....because of this the allocations and frees are not matching....which is resulting in some strange error like
malloc.c:No such file or directory
Although I am using calloc for allocation of memory everywhere in my code i am getting warnings like
Syscall param write(buf) points to uninitialised byte(s)
The code causing that error is
datapage *dataPage=(datapage *)calloc(1,PAGE_SIZE);
writePage(dataPage,dataPageNumber);
int writePage(void *buffer,long pagenumber)
{
int fd;
fd=open(path,O_WRONLY, 0644);
if (fd < 0)
return -1;
lseek(fd,pagenumber*PAGE_SIZE,SEEK_SET);
if(write(fd,buffer,PAGE_SIZE)==-1)
return false;
close(fd);
return true;
}
Exact error which i am getting when i am running through gdb is ...
Breakpoint 1, getInfoFromSysColumns (tid=3, numColumns=#0x7fffffffdf24: 1, typesVector=..., constraintsVector=..., lengthsVector=...,
columnNamesVector=..., offsetsVector=...) at dbheader.cpp:1080
Program received signal SIGSEGV, Segmentation fault.
_int_malloc (av=0x7ffff78bd720, bytes=8) at malloc.c:3498
3498 malloc.c: No such file or directory.
When i run the same through valgrind it's working fine...
Well,
malloc.c:No such file or directory
can occur while you are debugging using gdb and you use command "s" instead of "n" near malloc which essentially means you are trying to step into malloc, the source of which may not be not available on your Linux machine.
That is perhaps the same reason why it is working fine with valgrind.
Why error is in malloc:
The problem is that you overwrote some memory buffer and corrupted one
of the structures used by the memory manager. (c)
Try to run valgrind with --track-origins=yes and see where that uninitialized access comes from. If you believe that it should be initialized and it is not, maybe the data came from a bad pointer, valgrind will show you where exactly the values were created. Probably those uninitialized values overwrote your buffer, including memory manager special bytes.
Also, review all valgrind warnings before the crash.

How to prevent stack corruption?

I'm trying to debug segfault in native app for android.
GDB shows the following:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 5200]
0xbfcc6744 in ?? ()
(gdb) bt
#0 0xbfcc6744 in ?? ()
#1 0x5cfb5458 in WWMath::unProject (x=2.1136094475592566, y=472.2994384765625, z=0, mvpMatrix=#0x0,
viewport=#0x0, result=#0x0) at jni/src/core/util/WWMath.cpp:118
#2 0x00000000 in ?? ()
Is it possible to get a good stack? Or find a place where the stack was corrupted?
UPD:
The function mentioned takes references:
bool WWMath::unProject(double x, double y, double z, const Matrix &mvpMatrix,
const Rect& viewport, Vec4& result)
and reference to simple local variable is passed as the last argument:
Vec4 far, near;
if (!unProject(x, y, 0, tMvp, viewport, near))
We don't have much information to go by! There is no general rule to avoid memory corruption except to be careful with addressing.
But it looks to me like you overflowed an array of floats, because the bogus address 0xbfcc6744 equates to a reasonable float value -1.597 which is in line with the other values reported by GDB.
Overwriting the return address caused execution to jump to that value, so look specifically at the caller of the function WWMath::unProject, whose locals precede its return address, to find the offending buffer. (And now we have it, near.)
Compiling with --fstack-protector-all will cause your program to abort (with signal SIGABRT) when it returns from a function that corrupts the stack, if that corruption includes the area of the stack around the return address.
Stack-protector-all isn't a great debugging tool, but it is easy to try and sometimes does catch problems like this one. Though it will not point you to which line caused the problem, it will at least narrow it down to a single function. Once you have that information, you can step through it in GDB in order to pinpoint the line in question.
Solved this problem only by stepping line-by-line from the beginning of suspicious code and looknig for the moment when the stack gets corrupted.(It was ugly pointer arithmetic with two-dimensional arrays.)
And it seems that there was another way: try to put everything into the heap and hope that incorrect operation will cause segfault.