Update Sept. 12, 2011
I was able to get the core file and immediately dissabled the instruction that crashed. As per advice I tracked the value of r28 (by the way, no registry entry was log to hs_err_pid*.log) and check where did the value come from (see below w/ <---). However, I was not able to determine the value of r32.
Could the reason for the miss-alignment is that r28 is a 8-byte integer loaded to a 4-byte integer r31?
;;; 1053 if( Transfer( len ) == FALSE ) {
0xc00000000c0c55c0:2 <TFM::PrintTrace(..)+0x32>: adds r44=0x480,r32;; <---
0xc00000000c0c55d0:0 <TFM::PrintTrace(..)+0x40>: ld8 r43=[ret2]
0xc00000000c0c55d0:1 <TFM::PrintTrace(..)+0x41>: (p6) st4 [r35]=ret3
0xc00000000c0c55d0:2 <TFM::PrintTrace(..)+0x42>: adds r48=28,r33
0xc00000000c0c55e0:0 <TFM::PrintTrace(..)+0x50>: mov ret0=0;;
0xc00000000c0c55e0:1 <TFM::PrintTrace(..)+0x51>: ld8.c.clr r62=[r45]
0xc00000000c0c55e0:2 <TFM::PrintTrace(..)+0x52>: cmp.eq.unc p6,p1=r0,r62
;;; 1056 throw MutexLock ;
0xc00000000c0c55f0:0 <TFM::PrintTrace(..)+0x60>: nop.m 0x0
0xc00000000c0c55f0:1 <TFM::PrintTrace(..)+0x61>: nop.m 0x0
0xc00000000c0c55f0:2 <TFM::PrintTrace(..)+0x62>: (p6) br.cond.dpnt.many _NZ10TFM07PrintTraceEPi+0x800;;
;;; 1057 }
0xc00000000c0c5600:0 <TFM::PrintTrace(..)+0x70>: adds r41=0x488,r32
0xc00000000c0c5600:1 <TFM::PrintTrace(..)+0x71>: adds r40=0x490,r32
0xc00000000c0c5600:2 <TFM::PrintTrace(..)+0x72>: br.call.dptk.many rp=0xc00000000c080620;;
;;; 1060 dwDataLen = len ;
0xc00000000c0c5610:0 <TFM::PrintTrace(..)+0x80>: ld8 r16=[r44] <---
0xc00000000c0c5610:1 <TFM::PrintTrace(..)+0x81>: mov gp=r36
0xc00000000c0c5610:2 <TFM::PrintTrace(..)+0x82>: (p1) mov r62=8;;
0xc00000000c0c5620:0 <TFM::PrintTrace(..)+0x90>: cmp.eq.unc p6=r0,r16
0xc00000000c0c5620:1 <TFM::PrintTrace(..)+0x91>: nop.m 0x0
0xc00000000c0c5620:2 <TFM::PrintTrace(..)+0x92>: (p6) br.cond.dpnt.many _NZ10TFM07PrintTraceEPi+0xda0;;
0xc00000000c0c5630:0 <TFM::PrintTrace(..)+0xa0>: adds r21=16,r16 <---
0xc00000000c0c5630:1 <TFM::PrintTrace(..)+0xa1>: (p1) mov r62=8;;
0xc00000000c0c5630:2 <TFM::PrintTrace(..)+0xa2>: nop.i 0x0
0xc00000000c0c5640:0 <TFM::PrintTrace(..)+0xb0>: ld8 r42=[r21];; <---
0xc00000000c0c5640:1 <TFM::PrintTrace(..)+0xb1>: cmp.eq.unc p6=r0,r42
0xc00000000c0c5640:2 <TFM::PrintTrace(..)+0xb2>: nop.i 0x0
0xc00000000c0c5650:0 <TFM::PrintTrace(..)+0xc0>: nop.m 0x0
0xc00000000c0c5650:1 <TFM::PrintTrace(..)+0xc1>: mov r47=5
0xc00000000c0c5650:2 <TFM::PrintTrace(..)+0xc2>: (p6) br.cond.dpnt.many _NZ10TFM07PrintTraceEPi+0xdf0;;
0xc00000000c0c5660:0 <TFM::PrintTrace(..)+0xd0>: ld4.a r27=[r48]
;;; 1064 if( dwDataLen <= dwViewLen ) {
0xc00000000c0c5660:1 <TFM::PrintTrace(..)+0xd1>: adds r28=28,r42 <--
0xc00000000c0c5660:2 <TFM::PrintTrace(..)+0xd2>: cmp.ne.unc p6=r0,r46;;
0xc00000000c0c5670:0 <TFM::PrintTrace(..)+0xe0>: ld4.sa r26=[r28],
0xc00000000c0c5670:1 <TFM::PrintTrace(..)+0xe1>: (p6) ld4 r31=[r28] <-- instruction that crashed
Let me know if register values are needed. I think I can acquire the register value using info reg command of gdb.
This is the result of info registers (I excluded values of prXXX and brXXX), I don't have any idea how to map these to the disassembled instruction above.
gr1: 0x9fffffffbf716588
gr2: 0x9fffffff5f667c00
gr3: 0x9fffffff5f667c00
gr4: 0x6000000000e0b000
gr5: 0x9fffffff8adfe2e0
gr6: 0x9fffffff8ada9000
gr7: 0x9fffffff8ad7a000
gr8: 0x1
gr9: 0x9fffffff8adfd0f0
gr10: 0
gr11: 0xc000000000000690
gr12: 0x9fffffff8adfd140
gr13: 0x6000000001681510
gr14: 0x9fffffffbf7d8e98
gr15: 0x1a
gr16: 0x60000000044dac60
gr17: 0x1f
gr18: 0
gr19: 0x9fffffff8ad023f0
gr20: 0x9fffffff8adfd0e0
gr21: 0x60000000044dac70
gr22: 0x9fffffff5f668000
gr23: 0xd
gr24: 0x1
gr25: 0xc0000000004341f0
gr26: NaT
gr27: 0x63
gr28: 0xc00000000c5f801c
gr29: 0xc00000000029db20
gr30: 0xc00000000029db20
gr31: 0x288
gr32: 0x60000000044796d0
gr33: 0x6000000001a78910
gr34: 0x7e
gr35: 0x6000000001d03a90
gr36: 0x9fffffffbf716588
gr37: 0xc000000000000c9d
gr38: 0xc00000000c0c4f70
gr39: 0x9
gr40: 0x6000000004479b60
gr41: 0x6000000004479b58
gr42: 0xc00000000c5f8000
gr43: 0x9fffffffbf7144e0
gr44: 0x6000000004479b50
gr45: 0x6000000004479b68
gr46: 0x6000000001d03a90
gr47: 0x5
gr48: 0x6000000001a7892c
gr49: 0x9fffffff8adfe110
gr50: 0xc000000000000491
gr51: 0xc00000000c0c5520
gr52: 0xc00000000c07dd10
gr53: 0x9fffffff8adfe120
gr54: 0x9fffffff8adfe0a0
gr55: 0xc00000000000058e
gr56: 0xc00000000042be40
gr57: 0x39
gr58: 0x3
gr59: 0x33
gr60: 0
gr61: 0x9fffffffbf7d2598
gr62: 0x8
gr63: 0x9fffffffbf716588
gr64: 0xc000000000000f22
gr65: 0xc00000000c0c5610
This is an update to my previous post. Since I was furnished a copy
of the core file, I used gdb to examine the core file and executed
the following command:
1) bt
2) frame n <- the frame where the abort occurred
3) disas
And here are the results.
(gdb) bt
#0 0xc0000000001e5350:0 in _lwp_kill+0x30 ()
from /usr/lib/hpux64/libpthread.so.1
#1 0xc00000000014c7b0:0 in pthread_kill+0x9d0 ()
from /usr/lib/hpux64/libpthread.so.1
#2 0xc0000000002e4080:0 in raise+0xe0 () from /usr/lib/hpux64/libc.so.1
#3 0xc0000000003f47f0:0 in abort+0x170 () from /usr/lib/hpux64/libc.so.1
#4 0xc00000000e65e0d0:0 in os::abort ()
at /CLO/Components/JAVA_HOTSPOT/Src/src/os/hp-ux/vm/os_hp-ux.cpp:2033
#5 0xc00000000eb473e0:0 in VMError::report_and_die ()
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/utilities/vmError.cpp:1008
#6 0xc00000000e66fc90:0 in os::Hpux::JVM_handle_hpux_signal ()
at /CLO/Components/JAVA_HOTSPOT/Src/src/os_cpu/hp-ux_ia64/vm/os_hp-ux_ia64.cpp:1051
#7 <signal handler called>
#8 0xc00000000c0c5670:1 in TFMTrace::PrintTrace () at tfmtrace.cpp:1064
#9 0xc00000000c0c4f70:0 in FMLogger::WriteLog () at fmlogger.cpp:90
...
(gdb) frame 8
#8 0xc00000000c0c5670:1 in TFMTrace::PrintTrace () at tfmtrace.cpp:1064
1064 if( dwDataLen <= dwViewLen ) {
Current language: auto; currently c++
(gdb) disas $pc-16*4 $pc+16*4
...
0xc00000000c0c5660:0 <TFMTrace::PrintTrace(...)+0xd0> : ld4.a r27=[r48] MII,
;;; 1064 if( dwDataLen <= dwViewLen ) {
0xc00000000c0c5660:1 <TFMTrace::PrintTrace(...)+0xd1> : adds r28=28,r42
0xc00000000c0c5660:2 <TFMTrace::PrintTrace(...)+0xd2> : cmp.ne.unc p6=r0,r46;;
0xc00000000c0c5670:0 <TFMTrace::PrintTrace(...)+0xe0> : ld4.sa r26=[r28] MMI,
0xc00000000c0c5670:1 <TFMTrace::PrintTrace(...)+0xe1> : (p6) ld4 r31=[r28]
0xc00000000c0c5670:2 <TFMTrace::PrintTrace(...)+0xe2> : adds r46=24,r42;;
0xc00000000c0c5680:0 <TFMTrace::PrintTrace(...)+0xf0> : (p6) st4 [r35]=r31 MI,I
0xc00000000c0c5680:1 <TFMTrace::PrintTrace(...)+0xf1> : adds r59=36,r42;;
0xc00000000c0c5680:2 <TFMTrace::PrintTrace(...)+0xf2> : nop.i 0x0
0xc00000000c0c5690:0 <TFMTrace::PrintTrace(...)+0x100>: ld4.c.clr r27=[r48] MIB,
;;; 1066 dwLen = dwTrcLen ;
0xc00000000c0c5690:1 <TFMTrace::PrintTrace(...)+0x101>: cmp4.eq.unc p6,p8=99,r27
0xc00000000c0c5690:2 <TFMTrace::PrintTrace(...)+0x102>: nop.b 0x0;;
0xc00000000c0c56a0:0 <TFMTrace::PrintTrace(...)+0x110>: (p8) ld4.c.clr r26=[r28] MMI
;;; 1067 }
0xc00000000c0c56a0:1 <TFMTrace::PrintTrace(...)+0x111>: (p6) st4 [r48]=r47
0xc00000000c0c56a0:2 <TFMTrace::PrintTrace(...)+0x112>: cmp4.geu.unc p7=r26,r27
End of assemb
A "normal" crash in native code causes a report like this:
C [libc.so.6+0x88368] strstr+0x64a
Note small offset from the function (strstr in this case) to the crash point.
In your case, JVM decided that the address oxc00000000f675671 is inside libtracejni.so, but the closest function it could find is very far from the crash point (0x5065eff9 == 1.2 GB away).
Is your library really that big?
If it really is that big, chances are you have stripped it, and so the symbol _NZ10TFM07PrintTraceEPi doesn't actually have anything to do with the problem (which is in the code that is 1.2GB away).
You need to find out what code was really at address oxc00000000f675671 at the time of the crash. Usually hs_err_pid*.log contains a list of load addresses for all the shared libraries. Find the load address of libtracejni.so, subtract it from pc. That should give you an address similar to 0x400...675671 which you should be able to lookup in your unstripped version of libtracejni.so.
Also note that crash address ends with ASCII "C8G", which may or may not be a coincidence.
Update 2011/08/05.
Now you know which instruction crashed:
0x4000000000099670:1 <TFMTrace::PrintTrace(...)+0xe1>: (p6) ld4 r31=[r28]
This is a load of 4-byte integer from memory pointed by r28.
The next questions are: what is the value of r28 at crash point (should be logged in hs_err*.log), and also where did it come from (complete disassembly of TFM::PrintTrace will tell you that).
Related
I came across below code for walking backtrace
struct stack_frame {
struct stack_frame *prev;
void *return_addr;
} __attribute__((packed));
typedef struct stack_frame stack_frame;
__attribute__((noinline, noclone))
void backtrace_from_fp(void **buf, int size)
{
int i;
stack_frame *fp;
__asm__("movl %%ebp, %[fp]" : /* output */ [fp] "=r" (fp));
for(i = 0; i < size && fp != NULL; fp = fp->prev, i++)
buf[i] = fp->return_addr;
}
the reason behind looking for this code is we are using a 3rd party malloc hook hence don't want to use backtrace which again allocates memory. Above doesn't work for x86_64 and I modified asm statement to
__asm__("movl %%rbp, %[fp]" : /* output */ [fp] "=r" (fp));
I get crash
(gdb) bt
#0 backtrace_from_fp (size=10, buf=<optimized out>) at src/tcmalloc.cc:1910
#1 tc_malloc (size=<optimized out>) at src/tcmalloc.cc:1920
#2 0x00007f5023ade58d in __fopen_internal () from /lib64/libc.so.6
#3 0x00007f501e687956 in selinuxfs_exists () from /lib64/libselinux.so.1
#4 0x00007f501e67fc28 in init_lib () from /lib64/libselinux.so.1
#5 0x00007f5029a32503 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#6 0x00007f5029a241aa in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#7 0x0000000000000001 in ?? ()
#8 0x00007fff22cb8e24 in ?? ()
#9 0x0000000000000000 in ?? ()
(gdb)
(gdb) p $rbp
$2 = (void *) 0x7f501e695f37
(gdb) p (stack_frame *)$rbp
$3 = (stack_frame *) 0x7f501e695f37
(gdb) p *$3
$4 = {prev = 0x69662f636f72702f, return_addr = 0x6d6574737973656c}
(gdb) x /1xw 0x69662f636f72702f
0x69662f636f72702f: Cannot access memory at address 0x69662f636f72702f
(gdb) fr
#0 backtrace_from_fp (size=10, buf=<optimized out>) at src/tcmalloc.cc:1910
1910 in src/tcmalloc.cc
(gdb)
Am I missing something ?. Any help on how can I reconstruct the same via code ?.
Am I missing something ?
The code you referenced assumes the compiled code is using frame pointer register chain.
This was the default on (32-bit) i*86 up until about 5-7 years ago, and has not been the default on x86_64 since ~forever.
The code will most likely work fine in non-optimized builds, but will fail miserably with optimization on both 32-bit and 64-bit x86 platforms using non-ancient versions of the compiler.
If you can rebuild all code (including libc) with -fno-omit-frame-pointer, then this code will work most of the time (but not all the time, because libc may have hand-coded assembly, and that assembly will not have frame pointer chain).
One solution is to use libunwind. Unfortunately, using it from inside malloc can still run into a problem, if you (or any libraries you use) also use dlopen.
I try to debug crash (sigsegv especially) of application that uses dynamic libraries (.dll). As I can see it crashes in statically linked part of code in library initialization. GDB shows something similar to (addresses are changing):
Program received signal SIGSEGV, Segmentation fault.
0x00e41290 in __gcc_register_frame ()
from library.dll
1: x/5i $pc
=> 0xe41290 <__gcc_register_frame+208>: movl $0xebc6f0,0x11f2000
0xe4129a <__gcc_register_frame+218>: mov $0xebc2b0,%esi
0xe4129f <__gcc_register_frame+223>: jmp 0xe41222 <__gcc_register_frame+98>
0xe412a1 <__gcc_register_frame+225>: jmp 0xe412b0 <__gcc_deregister_frame>
During the same session I can see that 0x11f2000 is correct part of library.dll
(gdb) info symbol 0x11f2000
deregister_frame_fn in section .data of library.dll
Mentioned address space is available from GDB, so setting/reading values is possible:
(gdb) print (unsigned char*)0x11f2000
$1 = (unsigned char *) 0x11f2000 <deregister_frame_fn> ""
(gdb) set *(unsigned char*)0x11f2000=1
(gdb) set *(unsigned char*)0x11f2001=2
(gdb) set *(unsigned char*)0x11f2002=3
(gdb) set *(unsigned char*)0x11f2003=4
(gdb) print (unsigned char*)0x11f2000
$2 = (unsigned char *) 0x11f2000 <deregister_frame_fn> "\001\002\003\004"
I wonder I'm not able to make short example in cpp, since another library we use in the same application (with the same compiler) is working fine.
The main question is why there may be crash like sigsegv if memory location is available in gdb?
gcc version 5.2.0 (Sourcery CodeBench)
Below backtrace:
(gdb) bt
#0 0x00e41290 in __gcc_register_frame ()
from library.dll
#1 0x00eb7afa in __do_global_ctors ()
from library.dll
#2 0x00e4110b in DllMainCRTStartup#12 ()
from library.dll
...
I need some advice how to identify the source of the segfault.
compiled with ASAN:
==21093==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f09d744d882 bp 0x000000001000 sp 0x62100001c538 T0)
ASAN:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.
started with gdb:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5eeb882 in __memset_avx2_erms () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007ffff5eeb882 in __memset_avx2_erms () from /usr/lib/libc.so.6
#1 0xbebebebebebebebe in ?? ()
#2 0xbebebebebebebebe in ?? ()
...
1. Edit:
the output above was compiled for 64bit (x86_64), on 32bit following output is generated:
==8361==ERROR: AddressSanitizer failed to allocate 0x200000 (2097152) bytes of SizeClassAllocator32 (error code: 12)
==8361==Process memory map follows:
0x00200000-0x00300000
0x00400000-0x00500000
...
0xf7791000-0xf7792000 /lib32/ld-2.24.so
0xf7800000-0xffd00000
0xffe34000-0xffe55000 [stack]
==8361==End of process memory map.
==8361==AddressSanitizer CHECK failed: ../../../../../src/libsanitizer/sanitizer_common/sanitizer_common.cc:180 "((0 && "unable to mmap")) != (0)" (0x0, 0x0)
ERROR: Failed to mmap
2. EDIT:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff5eeb882 in __memset_avx2_erms () from /usr/lib/libc.so.6
(gdb) bt
#0 0x00007ffff5eeb882 in __memset_avx2_erms () from /usr/lib/libc.so.6
#1 0xbebebebebebebebe in ?? ()
#2 0xbebebebebebebebe in ?? ()
#3 0xbebebebebebebebe in ?? ()
#4 0xbebebebebebebebe in ?? ()
...
(gdb) record instruction-history
17798 0x00007ffff5eeb8b6 <__memset_avx2_unaligned_erms+22>: cmp $0x40,%rdx
17799 0x00007ffff5eeb8ba <__memset_avx2_unaligned_erms+26>: ja 0x7ffff5eeb8ca <__memset_avx2_unaligned_erms+42>
17800 0x00007ffff5eeb8ca <__memset_avx2_unaligned_erms+42>: cmp $0x800,%rdx
17801 0x00007ffff5eeb8d1 <__memset_avx2_unaligned_erms+49>: ja 0x7ffff5eeb870 <__memset_avx2_erms>
17802 0x00007ffff5eeb870 <__memset_avx2_erms+0>: vzeroupper
17803 0x00007ffff5eeb873 <__memset_avx2_erms+3>: mov %rdx,%rcx
17804 0x00007ffff5eeb876 <__memset_avx2_erms+6>: movzbl %sil,%eax
17805 0x00007ffff5eeb87a <__memset_avx2_erms+10>: mov %rdi,%rdx
17806 0x00007ffff5eeb87d <__memset_avx2_erms+13>: rep stos %al,%es:(%rdi)
17807 0x00007ffff5eeb87f <__memset_avx2_erms+15>: mov %rdx,%rax
not sure what that means/why this causes a segfault?
I need some advice how to identify the source of the segfault.
The GDB stack trace is typical of stack overflow similar to:
int main()
{
char buf[1];
memset(buf, 0xbe, 1<<20);
}
It is surprising that AddressSanitizer didn't catch that overflow.
I would try to debug it with the GDB branch trace support, as described here.
P.S. If you can construct a minimal example, Address Sanitizer developers will be interested in it.
Is it being built and run on different machines/environments?
I observe such segfaults for the executable compiled with asan when it is built and run on different environments/machines (don't observe if the lib versions are same). i.e. without asan the app runs fine on different machine.
In my case, when I run an app with address sanitizer on different machine:
./dummy_logger
ASAN:SIGSEGV
=================================================================
==18213==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000 (pc 0xf7f45e60 bp 0x1ffff000 sp 0xffab0a4c T16777215)
#0 0xf7f45e5f in _dl_get_tls_static_info (/lib/ld-linux.so.2+0x11e5f)
#1 0xf7a59d1c (/usr/lib/i386-linux-gnu/libasan.so.2+0xacd1c)
#2 0xf7a4ddbd (/usr/lib/i386-linux-gnu/libasan.so.2+0xa0dbd)
#3 0xf7f438ea (/lib/ld-linux.so.2+0xf8ea)
#4 0xf7f34cb9 (/lib/ld-linux.so.2+0xcb9)
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ??:0 _dl_get_tls_static_info
==18213==ABORTING
And works fine on the machine where it was compiled.
I'm using gdb for debugging a program. And what I see is kind of strange:
(gdb) bt
0xb59656f8 in globalCallStubClass::addCallContext (**this=0x0**)
at /ccase_enodeb/callp/build_callp/src/test/framework/shared/src/shared_call_context.cc:1962
0xb5b52e64 in rrcStubClass::process_scenario_spontaneous_trigger_RRC_CONNECTION_REQUEST (gcppMsgCtx=...)
at /ccase_enodeb/callp/build_callp/src/test/framework/rrc/src/rrc_connection_request.cc:90
0xb6c3be4c in Gcpp::routeMessage (this=0xb392e9d0) at /ccase_enodeb/callp/build_callp/src/callp_services/gcpp/src/gcpp.cc:1095
0xb6c3b3b0 in Gcpp::loop (this=0xb392e9d0, Default_Method_Ptr=0)
at /ccase_enodeb/callp/build_callp/src/callp_services/gcpp/src/gcpp.cc:925
0xb58d2ae0 in stubBthdEntryPoint () at /ccase_enodeb/callp/build_callp/src/test/framework/root/src/stub_root.cc:314
0x000191f8 in lxb_thd_entry (pCtx=0x68c0f8) at /vobs/onepltf/ltefdd/core/src/lxbase/lxbase.c:3289
0xb575602e in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
0xb56d6ab8 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
0xb56d6ab8 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) print pCallStub
$1 = (globalCallStubClass *) **0x7a1da8**
(gdb) print this
$2 = (globalCallStubClass * const) **0x0**
The chrash appears at line marked with (-->):
if (pCallStub != NULL) {
-->callStubClass* pCallInst = pCallStub->addCallContext();
}
Function addCallContext is called for object pCallStub (pCallStub is instantiated and is not NULL). When I print pCallStub I can see that it has an address:
(gdb) print pCallStub
$1 = (globalCallStubClass *) 0x7a1da8
but still, this (which should be pCallStub) is 0x0:
(gdb) print this
$2 = (globalCallStubClass * const) 0x0
Can anyone help me?
Thanks,
Geta
pCallStub is 0x0, so it's pointing to NULL. You have to instantiate an object with pCallStub = new globalCallStubClass() or a creator function like pCallStub = createGlobalCallStubClass() before using the pointer.
(gdb) print pCallStub
$1 = (globalCallStubClass *) **0x7a1da8**
(gdb) print this
$2 = (globalCallStubClass * const) **0x0**
You need to show more code for us to understand your issue.
There is no context here where we could see where this == pCallStub
Also, if you have optimizations turned on, you might not see what you think you're seeing (like the compiler optimized the function call and the stack so gdb doesn't report the right variable because it search it on the stack. Typically, on a x86 system, you'll find "this" in ecx register.
Since you have multiple thread you can have the "multithread singleton issue", that is, one thread is allocating and storing in the singleton instance, but other thread don't see it yet.
Try to use atomic compare and swap to set the singleton instance for example.
I have a program in a mips multicore system and I get a backtrace from core really hard to figure out (at least for me) , I suppose that maybe one of the other cores write to mem but not all the stack is corrupted what makes it more confusing for me.
In frame #2 this is NULL and in frame #0 this is NULL too (the cause of the core-dump).
This is (part) the backtrace:
#0 E::m (this=0x0, string=0x562f148 "", size=202) at E.cc:315
#1 0x00000000105c773c in P::e (this=0x361ecd00, string=0x562f148 "", size=202, offset=28) at P.cc:137
#2 0x00000000105c8c5c in M::e (this=0x0, id=7 '\a', r=2, string=0x562f148 "", size=202, oneClass=0x562f148 "", secondClass=0x14eff439 "",
offset=28) at M.cc:75
#3 0x0000000010596354 in m::find (this=0x4431fd70, string=0x562f148 "", size=202, oneClass=0x14eff438 "", secondClass=0x14eff439 "",
up=false) at A.cc:458
#4 0x0000000010597364 in A::trigger (this=0x4431fd70, triggerType=ONE, string=0x562f148 "", size=0, up=true) at A.cc:2084
#5 0x000000001059bcf0 in A::findOne (this=0x4431fd70, index=2, budget=0x562f148 "", size=202, up=true) at A.cc:1155
#6 0x000000001059c934 in A::shouldpathNow (this=0x4431fd70, index=2, budget=0x562f148 "", size=202, up=false, startAt=0x0, short=)
at A.cc:783
#7 0x00000000105a385c in A::shouldpath (this=0x4431fd70, index=2, rbudget=, rsize=, up=false,
direct=) at A.cc:1104
About the m::find function
442 m_t m::find(unsigned char const *string, unsigned int size,
443 hClass_t *hClass, h_t *fHClass,
444 bool isUp) {
445
446
447 const Iterator &it=arr_[getIndex()]->getSearchIterator((char const*)value, len);
448
449 unsigned int const offset = value - engine_->getData();
450 451 int ret=UNKNOWN;
452 M *p;
453 for(const void* match=it.next();
454 ret == UNKNOWN && match != NULL;
455 match = it.next()){
456 p = (M*)match;
457 if(p->needMore()){
458 ret = p->e(id_, getIndex(), value, len, hClass, fHClass, offset);
this=0x0 can actually happen pretty easily. For example:
E *instance = NULL;
instance->method();
this will be NULL within method.
There's no need to assume that the memory has been corrupted or the stack has been overwritten. In fact, if the rest of the stack's contents seem to make sense (and you seem to think that they do), then the stack is probably fine.
Instead of necessarily looking for memory corruption, check your logic to see if you have an uninitialized (NULL) pointer or reference.
Not being able to see all the code, its kind-of difficult to imagine what's happening. Could you also add the code for M::e() and P::e() or at least the important parts.
Something that might just solve everything is to add a NULL check, as follows in m::find():
456 p = (M*)match;
if(!p) { return; /* or do whatever */ }
457 if(p->needMore()){
458 ret = p->e(id_, getIndex(), value, len, hClass, fHClass, offset);
If p were NULL, I would have expected it to have crashed calling p->needMore(), but depending on what that method does, it may not crash.