Sigsegv when recompiled for arm architecture - c++

I am trying to figure out where I made a mistake in my c++ poco code. While running it on Ubuntu 14, the program runs correctly but when recompiled for the arm via gnueabi, it just crashes with sigsegv:
This is report from the stack trace (where it falls):
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(8888), sin_addr=inet_addr("192.168.2.101")}, 16) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x6502a8c4} ---
+++ killed by SIGSEGV +++
And this is code where it falls ( it should connect to the tcp server ):
this->address = SocketAddress(this->host, (uint16_t)this->port);
this->socket = StreamSocket(this->address); // !HERE
Note that I am catching any exceptions (like econnrefused) and it dies correctly when it can't connect. When it connect's to the server side, it just falls.
When trying to start valgrind, it aborts with error. No idea what shadow memory range means
==4929== Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
http://pastebin.com/Ky4RynQc here is full log
Thank you

Don't know why, this compiled badly on ubuntu, but when compiled on fedora (same script, same build settings, same gnu), it's working.
Thank you guys for your comments.

Related

Error calling the kernel: error code is: -6 on openCL kernel

my question is little different than that being asked on this platform, but I think it is related to programming.
I am using vitis and alveo accelerators. They basically have a host program( written in C++) and than a openCL file, which is also kind of C programming.
The host asks the kernel to do some thing by putting the input data through arguments and than results are send back to the host.
There are 3 ways to execute the data on kernel( Software emulation, HW emulation and Hardware). Generally, we start with Software emulation and than with Hardware. I have always seen that when code gets build and run with software emulation, it also builds and run on Hardware.
But I am getting a strange error as below:
INFO: Reading /home/m992c693/Workspace C2Q TC _paper/qht_3d_system/Hardware/vector addition.xclbin
Loading: '/home/m992c693/Workspace C20 TC_paper/qht_3d_system/Hardware/vector_addition.xclbin
Trying to program device[6]: xilinx_u250_gen3x16_xdma_shell_4 |
Device[6]: program successful
-./src/host.cpp:585 Error calling krnt_c2q =
XRT build version: 2.13.466
Build hash: 5505e402c2calffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:45:07
Git branch: 2022.1
PID: 243276
UID: 100586768
[Tue Sep 27 17:35:28 2622 GMT!
HOST: alveo
EXE: /home/m992c693/Workspace C20 TC_paper/qht_3d/Hardware/qht_3d
[XRT] ERROR: No such compute unit ‘c2q:c2q 1': Invalid argument
:Kernel (program, “c2q", Serr), error code is: -6
It would be great if someone can guide me why I am getting this error and what kind of debugging procedure I should follow.

Can't open a serial port in g++ on Ubuntu

I am trying to run the sample programs that came with a development lidar unit (RPLIDAR A1M8 360 Degree Laser Scanner Kit). The sample code for a Linux target compiles without error using g++
(Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 however when I run it, it stops with the message 'Error, cannot bind to the specified serial port /dev/ttyUSB0'. Using gdb I can trace the code down to the serial port open call to __libc_open64 where I don't have source anymore...
__libc_open64 (file=0x5555557832a9 "/dev/ttyUSB0", oflag=2306) at ../sysdeps/unix/sysv/linux/open64.c:36}
36 ../sysdeps/unix/sysv/linux/open64.c: No such file or directory.
Here is what I tried so far to eliminate obvious failure modes:
The serial port, a USB-to-serial converter on ttyUSB0, works just fine from Putty (115200, 8, none, 1, software flow control only). I connect it to a Raspberry Pi as the console to get a large amount of data without issue and can bidirectionally interact with the console. Therefore I conclude the port and converter work fine
The section of code that calls __libc_open64 is...
bool raw_serial::open(const char * portname, uint32_t baudrate, uint32_t flags)
{
if (isOpened()) close();
serial_fd = ::open(portname, O_RDWR | O_NOCTTY | O_NDELAY);
I looked up the constants that are ORed together and hard coded the value (04403) just in case there was some issue with the version of the header files. Oddly the value is off by 1 from the oflag value in the gdb line. Compiled and ran it, no difference
I verified the call to ::open returned -1 which is treated as a failure in the code immediately after
I can see in dmesg that ttyUSB0 is open and available
I am not a c++ guy. This looks to me like an issue with the g++ __libc_open64 code but that also seems very unlikely. I don't know where to go next. Any advice would be greatly appreciated.
The first comments received pointed to permissions. I chmoded /dev/ttyUSB0 wide open before starting this exercise.
I ran strace and see the line...
openat(AT_FDCWD, "/dev/ttyUSB0", O_ACCMODE|O_NOCTTY|O_NONBLOCK) = -1 ENOENT (No such file or directory)
Well, that's embarrassing! It returns Permission Denied! Yes, I chmoded permissions wide open, then promptly forgot that being a USB device, it goes away and gets redefined when I unplug/plug it in again. Thank you for your help!
The prediction that strace would show "Permission denied" was correct. I was forgetting this is not a fixed serial port but rather a USB-to-serial converter. Even though I chmoded the permissions and verified it use with Putty, I forgot that as soon as I rebooted, or unplugged the USB, the /dev/ttyUSB0 device goes away and is recreated again when I rebooted or plugged it in, requiring that I set the permissions again.

Debugging a nasty SIGILL crash: Text Segment corruption

Ours is a PowerPC based embedded system running Linux. We are encountering a random SIGILL crash which is seen for wide variety of applications. The root-cause for the crash is zeroing out of the instruction to be executed. This indicates corruption of the text segment residing in memory. As the text segment is loaded read-only, the application cannot corrupt it. So I am suspecting some common sub-system (DMA?) causing this corruption. Since the problem takes days to reproduce (crash due to SIGILL) it is getting difficult to investigate. So to begin with I want to be able to know if and when the text segment of any application has been corrupted.
I have looked at the stack trace and all the pointers, registers are proper.
Do you guys have any suggestions how I can go about it?
Some Info:
Linux 3.12.19-rt30 #1 SMP Fri Mar 11 01:31:24 IST 2016 ppc64 GNU/Linux
(gdb) bt
0 0x10457dc0 in xxx
Disassembly output:
=> 0x10457dc0 <+80>: mr r1,r11
0x10457dc4 <+84>: blr
Instruction expected at address 0x10457dc0: 0x7d615b78
Instruction found after catching SIGILL 0x10457dc0: 0x00000000
(gdb) maintenance info sections
0x10006c60->0x106cecac at 0x00006c60: .text ALLOC LOAD READONLY CODE HAS_CONTENTS
Expected (from the application binary):
(gdb) x /32 0x10457da0
0x10457da0 : 0x913e0000 0x4bff4f5d 0x397f0020 0x800b0004
0x10457db0 : 0x83abfff4 0x83cbfff8 0x7c0803a6 0x83ebfffc
0x10457dc0 : 0x7d615b78 0x4e800020 0x7c7d1b78 0x7fc3f378
0x10457dd0 : 0x4bcd8be5 0x7fa3eb78 0x4857e109 0x9421fff0
Actual (after handling SIGILL and dumping nearby memory locations):
Faulting instruction address: 0x10457dc0
0x10457da0 : 0x913E0000
0x10457db0 : 0x83ABFFF4
=> 0x10457dc0 : 0x00000000
0x10457dd0 : 0x4BCD8BE5
0x10457de0 : 0x93E1000C
Edit:
One lead that we have is that the corruption is always occurring at an offset that ends with 0xdc0.
For e.g.
Faulting instruction address: 0x10653dc0 << printed by our application after catching SIGILL
Faulting instruction address: 0x1000ddc0 << printed by our application after catching SIGILL
flash_erase[8557]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
nandwrite[8561]: unhandled signal 4 at 0fed6dc0 nip 0fed6dc0 lr 0fed6dac code 30001
awk[4448]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
awk[16002]: unhandled signal 4 at 0fe09dc0 nip 0fe09dc0 lr 0fe09dbc code 30001
getStats[20670]: unhandled signal 4 at 0fecfdc0 nip 0fecfdc0 lr 0fecfdbc code 30001
expr[27923]: unhandled signal 4 at 0fe74dc0 nip 0fe74dc0 lr 0fe74dc0 code 30001
Edit 2: Another lead is that the corruption is always occurring at physical frame number 0x00a4d. I suppose with PAGE_SIZE of 4096 this translates to physical address of 0x00A4DDC0. We are suspecting couple of our kernel drivers and investigating further. Is there any better idea (like putting hardware watchpoint) which could be more efficient? How about KASAN as suggested below?
Any help is appreciated. Thanks.
1.) Text segment is RO, but the permissions could be changed by mprotect, you can check that if you think it is possible
2.) If it is kernel problem:
Run kernel with KASAN and KUBSAN (undefined behaviour) sanitizers
Focus on drivers code not included in mainline
The hint here is one byte corruption. Maybe i'm wrong, but it means that DMA is not to blame. It looks like some kind of invalid store.
3.) Hardware. I think, your problem looks like a hardware problem (RAM issue).
You can try to decrease RAM system frequency in bootloader
Check if this problem reproduces on stable mainline software, that is how you can prove that it's it

Segmentation fault using GstDiscoverer (GStreamer)

I am writing desktop app for windows on C++ using Qt for GUI and GStreamer for audio processing.
In my app I need to monitor several internet aac audio streams if they are online, and listen to available stream that has the most priority. For this task I use GstDiscoverer object from GStreamer, but I have some problems with it.
I check audio streams every 1-2 seconds, so GstDiscoverer is called very often.
And every time my app is running, eventually it will crash with segmentation fault error during GstDiscoverer check.
I tried both sync and async methods of calling GstDiscoverer ( gst_discoverer_discover_uri(), gst_discoverer_discover_uri_async() ) , both work the same way.
The crash happens in aac_type_find() function from gsttypefindfunctions.c on line 1122 (second line of code below).
len = ((c.data[offset + 3] & 0x03) << 11) |
(c.data[offset + 4] << 3) | ((c.data[offset + 5] & 0xe0) >> 5);
Local variables received from debugger during one of crashes:
As we can see, offset variable is greater than c.size, so c.data[offset] is out of range, I think that's why segmentation fault happens.
This happens not regularly. The program can work several hours or ten minutes.
But it seems to me that it happens more often if time interval between calls of GstDiscoverer is little. So, there is some probability of crash calling aac_type_find().
I tried GStreamer versions 1.6.1 and latest 1.6.2, the bug exists in both.
Can somebody help me with this problem? Is this Gstreamer bug or maybe I do something wrong?
It was reported to the GStreamer project here and a patch for the crash was merged and will be in the next releases: https://bugzilla.gnome.org/show_bug.cgi?id=759910

C++ program does not process any function call or printf during SIGSEGV in gcc

I am having problem in getting my stack trace output to stderr or dumping to a log file. I am running the code in Kubuntu10.04 with gcc compiler (4.4.3). The issue is that in the normal running mode (without gdb), the program does not output anything except 'Segmentation Fault'. I wish to output the backtrace output as in the print statements below. When I run gdb with my application, it comes to the printf/fprintf/(function call) statement, and then crashes with the following statement:
669 {
(gdb)
670 printf("Testing for stability.\n");
(gdb)
Program received signal SIGTRAP, Trace/breakpoint trap.
0x00007ffff68b1f45 in puts () from /lib/libc.so.6
The strange things is that it works if I call a function within the same file that crashes, it works fine and spews the output properly. But if the program crashes in a function outside this file, it does not print any output.
So no printf or file dumping statement or function call gets processed. I am using the following sample code:
void bt_sighandler(int sig, siginfo_t *info,
void *secret) {
void *trace[16];
char **messages = (char **)NULL;
int i, trace_size = 0;
ucontext_t *uc = (ucontext_t *)secret;
/* Do something useful with siginfo_t */
if (sig == SIGSEGV)
printf("Got signal %d, faulty address is %p, "
"from %p\n", sig, info->si_addr,
uc->uc_mcontext.gregs[0]);
else
printf("Got signal %d#92; \n", sig);
trace_size = backtrace(trace, 16);
/* overwrite sigaction with caller's address */
trace[1] = (void *) uc->uc_mcontext.gregs[0];
messages = backtrace_symbols(trace, trace_size);
/* skip first stack frame (points here) */
printf("[bt] Execution path:#92; \n");
for (i=1; i<trace_size; ++i)
printf("[bt] %s#92; \n", messages[i]);
exit(0);
}
int main() {
/* Install our signal handler */
struct sigaction sa;
sa.sa_sigaction = (void *)bt_sighandler;
sigemptyset (&sa.sa_mask);
sa.sa_flags = SA_RESTART | SA_SIGINFO;
sigaction(SIGSEGV, &sa, NULL);
sigaction(SIGUSR1, &sa, NULL);
/* Do something */
printf("%d#92; \n", func_b());
}
Thanks in advance for any help.
Unfortunately you just can't reliably do much of anything in a SIGSEGV handler. Think about it this way: Your program has a serious error and its state (including system level state such as the heap) is in an inconsistent state.
In such a case, you can't expect the OS to magically fix up the heap and other internals it needs in order to be able to execute arbitrary code within your signal handler.
If the SEGV happens in your own code, the good solution is to use the core and fix the root problem. If the core happens in other code via say a shared library, I'd suggest isolating that code in an entirely separate binary and communicate between the two binaries. Then if the library crashes your main program does not.
You are supposed to do very little in a signal handler, in principle only access variables of type sig_atomic_t and volatile data.
Doing I/O is definitely out of the question. See this page for gcc:
http://www.gnu.org/s/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy
Try using simpler functions, such as strcat() and write().
Is there a reason you can't use valgrind?
When the application crashes Linux creates a core dump with the state of the application when it crashed. The core file can be examined using gdb.
If no core file is created try changing core file size with
ulimit -c unlimited
in the same shell and before the program is started.
The name of the core file is usually core.PID where PID is the pid of the program.
The core file is usually placed somewhere in /tmp or the directory where the program was started.
A lot more info on core files is available on the man page for core. Use
man core
to read the man page.
I managed to get it partially working. Actually I was running the application in 'sudo' mode. Running it in user mode gives me the callstack. However running in user mode disables hardware acceleration (nvidia graphics drivers). To resolve that, I added myself to the 'video' group, so that I have access to /dev/nvidia0 & /dev/nvidiactl. However when I get the access the stack does not get generated anymore. Its only when I am in user mode and hardware acceleration is disabled, the stack is coming. But I can't run my application without hardware acceleration (mean some important functionality would get disabled). Please let me know if anyone has any idea.
Thanks.