Need some help ORA-24550: signal received: - c++

This is my first question.If I have missed to add any information. Please let me know.
I am getting below error while executing a batch program.
ORA-24550: signal received: [si_signo=4] [si_errno=0] [si_code=30] [si_addr=0]
kpedbg_dmp_stack()+364<-kpeDbgCrash()+124<-kpeDbgSignalHandler()+680<-skgesig_sigactionHandler()+264<-__sighandler()<-00000000<-00000000
./PRMS_RlsRTS[231]: 53543194 Illegal instruction(coredump).
libdebug assertion "(framep->getGpr(STKP, &addr) == DB_SUCCESS && *nextStkpp == addr)" failed at line 1299 in file ../../../../../../../../../../../src/bos/usr/ccs/lib/libdbx/libdebug/modules/stackdebug/POWER/stackdb_FrameProgress.C
Illegal instruction in . at 0x0 ($t1)
0x00000000 00000000 Invalid opcode.
This is a pro *C program running it on AIX 7.1 with Oracle Client 12.1.

Related

Exit system call causes process to core dump

We were using exit(rc) in our C program. However, it is dumping core.
Attaching to dbx produced below O/P
warning: Unable to access address 0xf07519a4 from core
pthdb_session.c, 546: 1 PTHDB_CALLBACK (callback failed)
pthreaded.c, 1986: PTHDB_CALLBACK (callback failed)
Segmentation fault in Block_pool::~Block_pool() at line 32 in file ""
could not read "Pool.C"
(dbx) where
warning: Unable to access address 0xf07519a4 from core
pthdb_session.c, 546: 1 PTHDB_CALLBACK (callback failed)
pthreaded.c, 1986: PTHDB_CALLBACK (callback failed)
Block_pool::~Block_pool()(0xe9fb0, 0x0), line 32 in "Pool.C"
StringList_Init::~StringList_Init()() at 0x10169f0c
vxIpcChannel._::__srterm() at 0x101690d8
exit(??) at 0xd01835f0
Do we know some reasons why exit system call fails??

What does "Got an error handling event: Dwarf Error: Cannot find type of die" mean?

I'm trying to debug a hang on an old Apple PowerMac G5. The program is C++11, and it was created using MacPorts GCC 5.4. MacPorts also supplies the debugger, which is GDB 6.3.
Under the debugger I see:
...
Catchpoint 1 (exception thrown).
Catchpoint 1 (exception caught), throw location assert.cpp:33, catch location unknown, exception type Botan::Exception
0x009903ec 110 { ::operator delete(__p); }
(gdb) c
Continuing.
Util load/store ran 229 tests all ok
Util round_down ran 6 tests in 0.11 msec all ok
Util round_up ran 11 tests in 22.56 sec all ok
Die: DW_TAG_unspecified_type (abbrev = 3, offset = 247)
has children: FALSE
attributes:
DW_AT_name (DW_FORM_string) string: "decltype(nullptr)"
warning: Got an error handling event: "Dwarf Error: Cannot find type of die [in module /Users/botan/libbotan-1.11.35.35.dylib]".
After the above, things just hang. I can CTRL+C and Continue, but I seem to flop around _Unwind_RaiseException () and _Unwind_GetIPInfo ().
What does Got an error handling event: "Dwarf Error: Cannot find type of die mean?
I've found a couple of related questions, but they are not very satisfying and don't explain much:
Dwarf Error: Cannot find DIE
iceweasel: randomly crashes by receiving from X the following: ABORT: Request 155.34: BadLength
I also can't seem to get a decent backtrace:
(gdb) bt full
#0 0x009903ec in __cxa_throw () at ext/new_allocator.h:110
No symbol table info available.
Die: DW_TAG_unspecified_type (abbrev = 3, offset = 247)
has children: FALSE
attributes:
DW_AT_name (DW_FORM_string) string: "decltype(nullptr)"
Dwarf Error: Cannot find type of die [in module /Users/botan/libbotan-1.11.35.35.dylib]

Debugging a nasty SIGILL crash: Text Segment corruption

Ours is a PowerPC based embedded system running Linux. We are encountering a random SIGILL crash which is seen for wide variety of applications. The root-cause for the crash is zeroing out of the instruction to be executed. This indicates corruption of the text segment residing in memory. As the text segment is loaded read-only, the application cannot corrupt it. So I am suspecting some common sub-system (DMA?) causing this corruption. Since the problem takes days to reproduce (crash due to SIGILL) it is getting difficult to investigate. So to begin with I want to be able to know if and when the text segment of any application has been corrupted.
I have looked at the stack trace and all the pointers, registers are proper.
Do you guys have any suggestions how I can go about it?
Some Info:
Linux 3.12.19-rt30 #1 SMP Fri Mar 11 01:31:24 IST 2016 ppc64 GNU/Linux
(gdb) bt
0 0x10457dc0 in xxx
Disassembly output:
=> 0x10457dc0 <+80>: mr r1,r11
0x10457dc4 <+84>: blr
Instruction expected at address 0x10457dc0: 0x7d615b78
Instruction found after catching SIGILL 0x10457dc0: 0x00000000
(gdb) maintenance info sections
0x10006c60->0x106cecac at 0x00006c60: .text ALLOC LOAD READONLY CODE HAS_CONTENTS
Expected (from the application binary):
(gdb) x /32 0x10457da0
0x10457da0 : 0x913e0000 0x4bff4f5d 0x397f0020 0x800b0004
0x10457db0 : 0x83abfff4 0x83cbfff8 0x7c0803a6 0x83ebfffc
0x10457dc0 : 0x7d615b78 0x4e800020 0x7c7d1b78 0x7fc3f378
0x10457dd0 : 0x4bcd8be5 0x7fa3eb78 0x4857e109 0x9421fff0
Actual (after handling SIGILL and dumping nearby memory locations):
Faulting instruction address: 0x10457dc0
0x10457da0 : 0x913E0000
0x10457db0 : 0x83ABFFF4
=> 0x10457dc0 : 0x00000000
0x10457dd0 : 0x4BCD8BE5
0x10457de0 : 0x93E1000C
Edit:
One lead that we have is that the corruption is always occurring at an offset that ends with 0xdc0.
For e.g.
Faulting instruction address: 0x10653dc0 << printed by our application after catching SIGILL
Faulting instruction address: 0x1000ddc0 << printed by our application after catching SIGILL
flash_erase[8557]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
nandwrite[8561]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
awk[4448]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
awk[16002]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
getStats[20670]: unhandled signal 4 at 0fecfdc0 nip 0fecfdc0 lr 0fecfdbc code 30001
expr[27923]: unhandled signal 4 at 0fe74dc0 nip 0fe74dc0 lr 0fe74dc0 code 30001
Edit 2: Another lead is that the corruption is always occurring at physical frame number 0x00a4d. I suppose with PAGE_SIZE of 4096 this translates to physical address of 0x00A4DDC0. We are suspecting couple of our kernel drivers and investigating further. Is there any better idea (like putting hardware watchpoint) which could be more efficient? How about KASAN as suggested below?
Any help is appreciated. Thanks.
1.) Text segment is RO, but the permissions could be changed by mprotect, you can check that if you think it is possible
2.) If it is kernel problem:
Run kernel with KASAN and KUBSAN (undefined behaviour) sanitizers
Focus on drivers code not included in mainline
The hint here is one byte corruption. Maybe i'm wrong, but it means that DMA is not to blame. It looks like some kind of invalid store.
3.) Hardware. I think, your problem looks like a hardware problem (RAM issue).
You can try to decrease RAM system frequency in bootloader
Check if this problem reproduces on stable mainline software, that is how you can prove that it's it

Smart card GENERAL AUTHENTICATE fails under VMWare

Trying to work with a smart card reader from a VMWare Player virtual machine, which is Windows 8.1 AMD64. The card is a US gov't issued PIV card, as described in the respective NIST standard. The host is Windows 7 AMD64.
I'm working with WinSCard API. VERIFY and GET DATA commands work as expected. However, when I perform GENERAL AUTHENTICATE to generate a digital signature, SCardTransmit() returns error code 1, and in the debug output there are messages:
First-chance exception at 0x77675B68 (KernelBase.dll) in PIVTool.exe: 0x00000001: Incorrect function.
First-chance exception at 0x77675B68 (KernelBase.dll) in PIVTool.exe: 0x0000071A: The remote procedure call was canceled, or if a call time-out was specified, the call timed out.
First-chance exception at 0x77675B68 in PIVTool.exe: Microsoft C++ exception: unsigned long at memory location 0x0113E48C.
And in the system log, there are some messages to that effect too:
Smart Card Service, ID 610: Smart Card Reader 'VMware Virtual USB CCID 0' rejected IOCTL TRANSMIT: Incorrect function. If this error persists, your smart card or reader may not be functioning correctly.
Command Header: 00 87 07 9c
The command header matches what I transmit.
WudfUsbccidDrv ID 11: A Request has returned failure.
MsgType: 0x80
ICCStatus: 0x0
CmdStatus: 0x1
Error: 0x0
SW1: 0x0
SW2: 0x0
and then
WudfUsbccidDrv ID 1: An operation has failed (0x0, 0x0, 0x0, 0x0).
ScT1Transmit: Failed to send request.
HResult: The specified request is not a valid operation for the target device.
and also
WudfUsbccidDrv ID 10: Request [ 0 ] (CLS=0x0,INS=0x87,P1=0x7,P2=0x9C,Lc=266,Le=0,.NETServiceMethod=0x0)
again, that's exactly my request.
The very same code works as expected on the host machine. Same card, same physical reader, same command. The card driver might be different.
I've tried an equivalent piece of Java code against the SunPCSC security provider, just to check for subtle protocol failures; it fails on the VM with a similar message:
javax.smartcardio.CardException: sun.security.smartcardio.PCSCException: Unknown error 0x1
Looks like the VMWare's virtualization layer for smart cards doesn't like this particular command. Any ideas, please?

gdb - identify the reason of the segmentation fault

I have a server/client system that runs well on my machines. But it core dumps at one of the users machine (OS: Centos 5). Since I don't have access to the user's machine so I built a debug mode binary and asked the user to try it. The crash did happened again after around 2 days of running. And he sent me the core dump file. Loading the core dump file with gdb, it did shows the crash location but I don't understand the reason (sorry, my previous experience is mostly with Windows. I don't have much experience with Linux/gdb). I would like have your input. Thanks!
1. the /var/log/messages at the user's machine shows the segfault:
Jan 16 09:20:39 LPZ08945 kernel: LSystem[4688]: segfault at 0000000000000000 rip 00000000080e6433 rsp 00000000f2afd4e0 error 4
This message indicates that there is a segfault at instruction pointer 80e6433 and stack pointer f2afd4e0. Looks that the program tries to read/write at address 0.
2. load the core dump file into gdb and it shows the crash location:
$gdb LSystem core.19009
GNU gdb (GDB) CentOS (7.0.1-45.el5.centos)
... (many lines of outputs from gdb omitted)
Core was generated by `./LSystem'.
Program terminated with signal 11,
Segmentation fault.
'#0' 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214
214 memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
gdb says the crash occurs at Line 214?
3. Frame information. (at Frame #0)
(gdb) info frame
Stack level 0, frame at 0xf2afd7e0:
eip = 0x80e6433 in CLClient::connectToServer (liccomm/LClient.cpp:214); saved eip 0x80e6701
called by frame at 0xf2afd820
source language c++.
Arglist at 0xf2afd7d8, args: this=0xf2afd898, conn=11
Locals at 0xf2afd7d8, Previous frame's sp is 0xf2afd7e0
Saved registers:
ebx at 0xf2afd7cc, ebp at 0xf2afd7d8, esi at 0xf2afd7d0, edi at 0xf2afd7d4, eip at 0xf2afd7dc
The frame is at f2afd7e0, why it's different than the rsp from Part 1, which is f2afd4e0? I guess the user may have provided me with mismatched core dump file (whose pid is 19009) and /var/log/messages file (which indicates a pid 4688).
4. The source
(gdb) list +
209
210 //pHost is declared as struct hostent* and 'pHost = gethostbyname(serverAddress);'
211 memset( &a4, 0, sizeof(a4) );
212 a4.sin_family = AF_INET;
213 a4.sin_port = htons( nPort );
214 memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
215
216 aalen = sizeof(a4);
217 aa = (struct sockaddr *)&a4;
I could not see anything wrong with Line 214. And this part of the code must ran many times during the runtime of 2 days.
5. The variables
Since gdb indicated that Line 214 was the culprit. I printed everything.
memcpy((char *) & (a4.sin_addr), pHost->h_addr, pHost->h_length);
(gdb) print a4.sin_addr
$1 = {s_addr = 0}
(gdb) print &(a4.sin_addr)
$2 = (in_addr *) 0xf2afd794
(gdb) print pHost->h_addr_list[0]
$3 = 0xa24af30 "\202}\204\250"
(gdb) print pHost->h_length
$4 = 4
(gdb) print memcpy
$5 = {} 0x2fcf90
So I basically printed everything that's at Line 214. ('pHost->h_addr_list[0]' is 'pHost->h_addr' due to '#define h_addr h_addr_list[0]')
I was not able to catch anything wrong. Did you catch anything fishy? Is it possible the memory has been corrupted somewhere else? I appreciate your help!
[edited] 6. back trace
(gdb) bt
'#0' 0x080e6433 in CLClient::connectToServer (this=0xf2afd898, conn=11) at liccomm/LClient.cpp:214
'#1' 0x080e6701 in CLClient::connectToLMServer (this=0xf2afd898) at liccomm/LClient.cpp:121
... (Frames 2~7 omitted, not relevant)
'#8' 0x080937f2 in handleConnectionStarter (par=0xf3563f98) at LManager.cpp:166
'#9' 0xf7f5fb41 in ?? ()
'#10' 0xf3563f98 in ?? ()
'#11' 0xf2aff31c in ?? ()
'#12' 0x00000000 in ?? ()
I followed the nested calls. They are correct.
The problem with the memcpy is that the source location is not of the same type than the destination.
You should use inet_addr to convert addresses from string to binary
a4.sin_addr = inet_addr(pHost->h_addr);
The previous code may not work depending on the implementation (some my return struct in_addr, others will return unsigned long, but the principle is the same.