The process running the following code crashes with a Segmentation fault:
#include <stdlib.h>
#include <iostream>
#include <pthread.h>
void* f( void* )
{
while( true )
{
// It crashes inside this call (with cerr, too).
std::cout << 0;
}
return NULL;
}
int main()
{
pthread_t t;
pthread_create( &t, NULL, &f, NULL );
while( true )
{
// It crashes with any script/app; true is just simple.
system( "true" );
}
return 0;
}
It crashes about every other execution within a few seconds (output has anywhere from thousands to millions of '0's). It crashes a few functions deep in the cout << 0 call with the above code. Depending on extra functions called or data put on the stack in f(), it crashes in different places. In gdb, sometimes the stack doesn't make sense with regard to the order of the function calls. From this I deduce the stack is corrupted.
I found there are some problems with multi-threaded applications calling fork() (see also two of the comments mentioning stack corruption). Forking/cloning a process copies the file descriptors if they aren't set to FD_CLOEXEC. However, there are no explicitly created file descriptors. (I tried setting FD_CLOEXEC on fileno( stdout ) and fileno( stderr ) with no positive change.)
Even without explicit file descriptors can I not mix threads and fork()? Do I simply need to replace the system() call with equivalent functionality? Or is there a bug in the kernel that causes this crash and has been fixed after 2.6.30?
Other Details
I am running it on an ARM AT91 processor (armv5tejl) with Linux 2.6.30 (with some overlays and patches for my specific set of peripherals) compiled with GCC 4.3.2.
Linux 2.6.30 #1 Thu May 29 15:43:04 CDT 2014 armv5tejl GNU/Linux
I had been [cross] compiling it with -g and -O0, but without those it still crashes:
arm-atmel-linux-gnueabi-g++ -o system_thread system_thread.cpp -lpthread
I've also tried the -fstack-protector-all flag: Sometimes it crashes in __stack_chk_fail(), but sometimes other function pointers or data get corrupted and it crashes earlier.
The libraries it loads (from strace):
libpthread.so.0
libstdc++.so.6
libm.so.6
libgcc_s.so.1
libc.so.6
Note: Since it sometimes does not crash and is not really responsive to ^C, I typically run it in the background:
$ killall -9 system_thread; rm -f log; system_thread >log &
I have compiled this program for a few different architectures and Linux kernel versions, but I have not seen it crash anywhere else:
Linux 3.10.29 #1 Wed Feb 12 17:12:39 CST 2014 armv5tejl GNU/Linux
Linux 3.6.0-dirty #3 Wed May 28 13:53:56 CDT 2014 microblaze GNU/Linux
Linux 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:06:16 UTC 2014 x86_64 x86_64 GNU/Linux
Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
EDIT: Note that on the same architecture (armv5tejl) it does not crash with Linux 3.10.29. Also, it does not crash when running on an earlier version of my "appliance" (older server and client applications), having the same version of Linux - 2.6.30. So the environment of the OS has some effect.
BusyBox v1.20.1 provides sh that system() calls.
This is reproducible on an ARM processor using the 2.6.30 kernel that you mentioned, but not in master. We can use git bisect to find where this bug was fixed (it took about 16 iterations). Note that, since git bisect is meant to find regressions, but in this case master is "good" but a past version is "bad," we need to reverse the meanings of "good" and "bad".
The culprit found by the bisection is this commit, to fix "an instance of userspace data corruption" involving fork(). This symptom is very similar to the symptom you describe, and could also corrupt memory outside of the stack. After backporting this commit and the required parent to the 2.6.30 kernel, the code you posted no longer crashes.
Related
folks! I am writing software that must install and run on as many flavors of Linux as possible, but I must be able to compile it on a single Jenkins slave.
Right now this mostly works. But I have run into a case whereby a special combination of things will produce a segfault on Debian 10, but not on any of my numerous other supported flavors of Linux. I have been able to reproduce this in 3 different apps (some of which have been working for years), including the simplified prototype I have listed below.
// g++ -g -o ttt -static tt.cpp
// The above compile of this code on Centos 6 will produce a segfault
// when run on Debian 10, but not on any other tested flavor of Linux.
// Dozens of them. The version of g++ is 4.7.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(int _argc, char* _argv[])
{
srand(time(0));
printf("success\n");
return 0;
}
What I have found by running each of my 3 apps on Debian 10 with gdb is that it will segfault under these conditions.
They must be compiled with the -static flag. If I don't use -static, it works fine, regardless on which flavor it is compiled.
They must call the time() function. It doesn't really matter how I call it, but it must be called. I tried the usual suspects like passing a NULL pointer and passing a real pointer. It always segfaults when the app is compiled statically.
They must be compiled on Centos 6 and run on Debian 10. If I compile statically on Debian 10, the prototype works fine.
So here are my constraints that I am working under.
I have to compile on one Linux slave, because I am distributing just one binary. Keeping track of multiple binaries and which one goes on what Linux flavor is not really an option.
I have to compile statically or it creates incompatibilities on other supported flavors of Linux.
I have to use an older Linux flavor, also for the sake of compatibility. It doesn't have to be Centos, but it has to produce binaries that will run on Centos, as well as numerous other flavors.
I have to use g++ 4.7 also for code compatibility.
In your answers, I am hoping for some kind of code trick. Maybe a good, reliable replacement for the time() function. Or a suggestion for another flavor of Linux that is compatible with Debian 10.
Bonus points would go to whoever is able to explain the black magic of why a basic, ubiquitous function like time() would be completely compatible on Debian 9, but segfaults on Debian 10 ONLY when it is compiled statically on Centos 6...
EDIT:
strace on the Centos 6 server:
execve("./ttt", ["./ttt"], [/* 37 vars */]) = 0
uname({sys="Linux", node="testcent6", ...}) = 0
brk(0) = 0x238c000
brk(0x238d180) = 0x238d180
arch_prctl(ARCH_SET_FS, 0x238c860) = 0
brk(0x23ae180) = 0x23ae180
brk(0x23af000) = 0x23af000
gettimeofday({1585687633, 358976}, NULL) = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f682c82f000
write(1, "success\n", 8success
) = 8
exit_group(0) = ?
+++ exited with 0 +++
strace on the Debian 10 server:
execve("./ttt", ["./ttt"], 0x7fff0430dfd0 /* 18 vars */) = 0
uname({sysname="Linux", nodename="deletemedebian10", ...}) = 0
brk(NULL) = 0x1f6f000
brk(0x1f70180) = 0x1f70180
arch_prctl(ARCH_SET_FS, 0x1f6f860) = 0
brk(0x1f91180) = 0x1f91180
brk(0x1f92000) = 0x1f92000
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffffffffff600400} ---
+++ killed by SIGSEGV +++
Segmentation fault
The executable is trying to use the vsyscall interface to implement the syscall used for the time function.
This interface has been deprecated in favor of the vdso for a long time. It was dropped completely a while back, but can still be emulated.
Debian 10 seems to have disabled the vsyscall emulation, which is done for security reasons because it may make attacks easier. You should be able to re-enable the emulation by passing the kernel command line option vsyscall=emulate at startup, of course with the mentioned security repercussions, if that is an option.
The glibc version on CentOS 6 seems to be 2.12, which is too old to make use of the vdso. So to compile a compatible binary for newer kernel configuations, you need at least glibc 2.14 instead. I don't know whether this can be easily installed on CentOS or whether it will work correctly with the kernel shipped with it.
You should also consider whether you really need a fully static binary. You could link everything statically except libc.
I'm using mapped_file_source from the boost::iostreams namespace to read from a large file in chunks:
boost::iostreams::mapped_file_source read_bytes(const char *file_path,
unsigned long long int offset,
unsigned long long int length) {
iostreams::mapped_file_params parameters;
parameters.path = file_path;
parameters.length = static_cast<size_t>(length);
parameters.flags = iostreams::mapped_file::mapmode::readonly;
parameters.offset = static_cast<boost::iostreams::stream_offset>(offset);
boost::iostreams::mapped_file_source file;
file.open(parameters);
if (file.is_open()) {
return file;
} else {
printf("Failed to open file\n");
exit(EXIT_FAILURE);
}
}
My code works fine for Ubuntu in WSL (Windows Subsystem for Linux) but when I compile and run it on Windows, the 2nd file.open call causes the process to exit with exit code 3.
Reading file in 5 parts
Processing chunk 1/5
Processing chunk 2/5
Process finished with exit code 3
No error message or exception has been caused. The documentation suggests it to be ERROR_PATH_NOT_FOUND but that makes no sense.
I debugged both platform binaries and all variables are exactly identical with the only exceptions being the Unix style file path and the Windows style path as well as allocated addresses and system time variables so no memory corruption has occurred. I do not understand why this doesn't work on Windows when it should behave identically.
I'm using MinGW to compile for Windows and gcc 8.2 in Ubuntu.
"C:\Program Files\mingw-w64\x86_64-8.1.0-win32-seh-rt_v6-rev0\mingw64\bin\x86_64-w64-mingw32-gcc.exe" --version
x86_64-w64-mingw32-gcc.exe (x86_64-win32-seh-rev0, Built by MinGW-W64 project) 8.1.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
If I read the file in a single go, it works fine (!). I'm aligning all offsets to page size multiples. The mapped_file_source is automatically closed when it falls out of scope so it's not a "file already open" issue (that would actually cause an exception by Boost).
Using MSVC the problem can no longer be reproduced now. In general, using Microsoft's compiler for Windows may be more reliable than the likes of MinGW, especially since I was using an "unofficial" toolchain.
At inconsistent intervals of time, I've seen one particular program becoming hung up lately (that is, halting execution but not crashing and not spinning the CPU). When termination with core dumps is forced, it's consistently stuck on this line:
int new_socket = accept4(listen_socket,NULL,NULL,SOCK_NONBLOCK);
Since this is a non-blocking accept, how can the program hang up there? It doesn't appear as if operating conditions change dramatically between functional and halted execution.
I am no network programming expert, so please let me know what other source (if any) would provide context for tracking this down.
EDIT: This software is running on and compiled with the following
$ uname -a
Linux phoenix 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ g++ --version
g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
$ ldd --version
ldd (Ubuntu EGLIBC 2.19-0ubuntu6.6) 2.19
The option SOCK_NONBLOCK does not refer to the listening socket (the one passed to accept4()), but to the one getting created on acceptance of a connection.
Verbatim from man accept4:
SOCK_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description.
I have compiled a cpp file with this command line: g++ -g test.cpp
It throws an exception at line 28. I want to investigate the cause by inspecting the variables in lldb. I set a break point at line 28 and run the a.out in lldb.
(lldb) n
Process 84233 stopped
* thread #1: tid = 0xa44b86, 0x00000001000017fb a.out`say(s=<unavailable>) + 987 at so.cpp:28, queue = 'com.apple.main-thread', stop reason = step over
frame #0: 0x00000001000017fb a.out`say(s=<unavailable>) + 987 at so.cpp:28
25 }
26 else{
27 s.insert(0, to_string(sz));
-> 28 s.erase(2, sz-1);
29 }
30 return s;
31 }
(lldb) po s
error: Couldn't materialize: couldn't get the value of variable s: variable not available
Errored out in Execute, couldn't PrepareToExecuteJITExpression
Why the error message? How can I inspect the variable s?
lldb version: lldb-320.4.115.3
g++ version: Configured with: --prefix=/Applications/Xcode6-Beta5.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 6.0 (clang-600.0.45.3) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
That error means the debug information does mention the variable, but says it has no storage location at the current PC.
That can be because the variable got optimized out (unlikely given you are just calling a function on the variable) or because the compiler flubbed the debug info for the variable and lost track of where it went.
Make sure you are compiling the code you are trying to debug at -O0 as there aren't many compilers that emit good debug information at higher optimization levels. If you are compiling at -O0, this is a compiler bug. You should probably report it to the gcc folks. You could see if you have better luck with clang. Otherwise, you have to read the assembly of the function to figure out where the variable actually lives, then tell the debugger to print the appropriately cast address.
I had this problem when I enabled the "Address Sanitizer" from my app scheme. Disable it fixed the issue.
I see this when I run a RELEASE (vs a DEBUG) build (Product->Scheme...->Edit Scheme...->Info, then set Build Configuration to "Debug".
I had this issue when compiling with the flag -Og. For some reason I was thinking that that meant 'optimize for debugging'. I don't think thats the case in reality. Removing that flag fixed the issue for me.
I am using 3 HP-UX PA RISC machines for testing. My binary is failing on one PA RISC machine where as others it working. Note that, even though binary is executed with version check i.e. it just print version and exit and don't perform any other operation , still binary is giving segmentation fault. what could be probable reason for Segmentation fault. It is important to me to find out root cause of the failure on one box. As program is working on 2 HP-UX, it seems that it is environment issue?
I tried to copy same piece of code (i.e. declare variables, print version and exit) in test program and build with same compilation options but it is working. Here is gdb output for the program.
$ gdb prg_us
Detected 64-bit executable.
Invoking /opt/langtools/bin/gdb64.
HP gdb 5.4.0 for PA-RISC 2.0 (wide), HP-UX 11.00
and target hppa2.0w-hp-hpux11.00.
Copyright 1986 - 2001 Free Software Foundation, Inc.
Hewlett-Packard Wildebeest 5.4.0 (based on GDB) is covered by the
GNU General Public License. Type "show copying" to see the conditions to
change it and/or distribute copies. Type "show warranty" for warranty/support.
..
(gdb) b 5573
Breakpoint 1 at 0x4000000000259e04: file pmgreader.c, line 5573 from /tmp/test/prg_us.
(gdb) r -v
Starting program: /tmp/test/prg_us -v
Breakpoint 1, main (argc=2, argv=0x800003ffbfff05f8) at pmgreader.c:5573
5573 if (argc ==2 && strcmp (argv[1], "-v") == 0)
Current language: auto; currently c++
(gdb) n
5575 printf ("%s", VER);
(gdb) n
5576 exit(0);
(gdb) n
Program received signal SIGSEGV, Segmentation fault
si_code: 0 - SEGV_UNKNOWN - Unknown Error.
0x800003ffbfb9e130 in real_free+0x480 () from /lib/pa20_64/libc.2
(gdb)
What should be probable cause? why it is working on one and not on another?
Just a long shot - are you including both stdio.h and stdlib.h so the prototypes for printf() and exit() are known to the compiler?
Actually, after a bit more thought (and noticing that C++ is in the mix), you may have some static object initialization causing problems (possibly corrupting the heap?).
Unfortunately, it looks like valgrind is not supported on PA-RISC - is there some similar tool on PA-RISC you can run? If not, it might be worthwhile running valgrind on an x64 build of your program if it's not too difficult to set that up.
Michael Burr already hinted at the problem: it's a global object.
Notice that the crash is from a free function. That indicates a memory deallocation, and in turn a destructor. This makes sense given the context: global destructors run after exit(0). A stack trace will show more detail.