Hung program on a non-blocking socket - c++

At inconsistent intervals of time, I've seen one particular program becoming hung up lately (that is, halting execution but not crashing and not spinning the CPU). When termination with core dumps is forced, it's consistently stuck on this line:
int new_socket = accept4(listen_socket,NULL,NULL,SOCK_NONBLOCK);
Since this is a non-blocking accept, how can the program hang up there? It doesn't appear as if operating conditions change dramatically between functional and halted execution.
I am no network programming expert, so please let me know what other source (if any) would provide context for tracking this down.
EDIT: This software is running on and compiled with the following
$ uname -a
Linux phoenix 3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:43:14 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ g++ --version
g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
$ ldd --version
ldd (Ubuntu EGLIBC 2.19-0ubuntu6.6) 2.19

The option SOCK_NONBLOCK does not refer to the listening socket (the one passed to accept4()), but to the one getting created on acceptance of a connection.
Verbatim from man accept4:
SOCK_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description.

Related

Speed up extremely slow MinGW-w64 compilation/linking?

How can I speed up MinGW-w64's extremely slow C++ compilation/linking?
Compiling a trivial "Hello World" program:
#include <iostream>
int main()
{
std::cout << "hello world" << std::endl;
}
...takes 3 minutes(!) on this otherwise-unloaded Windows 10 box (i7-6700, 32GB of RAM, decent SATA SSD):
> ptime.exe g++ main.cpp
ptime 1.0 for Win32, Freeware - http://www.pc-tools.net/
Copyright(C) 2002, Jem Berkes <jberkes#pc-tools.net>
=== g++ main.cpp ===
Execution time: 180.488 s
Process Explorer shows the g++ process tree bottoming out in ld.exe which doesn't use any appreciable CPU or I/O for the duration.
Running the g++ process tree through API Monitor shows there are three unusually long syscalls in ld.exe: two NtCreateFile()s and a NtOpenFile(), each operating on a.exe and taking 60 seconds apiece.
The slowness only happens when using the default a.exe output; g++ -o foo.exe main.cpp takes 2 seconds, tops.
"Well don't use a.exe as an output name then!" isn't really a solution since this behavior causes CMake to take ages doing compiler feature detection.
GCC toolchain versions:
>g++ --version
g++ (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0
>ld --version
GNU ld (GNU Binutils) 2.30
Given that I couldn't repro the problem in a clean Windows 10 VM and the dependence on the output filename led me down the path of anti-virus/anti-malware interference.
fltmc instances listed several possible filesystem filter drivers; guess-n-check narrowed it down to two of Carbon Black's: carbonblackk & ParityDriver.
Using Regedit to disable them via setting Start to 0x4 ("Disabled", 0x2 == Automatic, 0x3 == Manual) in these two registry keys followed by a reboot fixed the slowness:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\carbonblackk
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ParityDriver

Debugging with Gdb in RISCV (spike: unrecognized option --gdb-port)

After building the RISCV tools and GCC (cloned from lowrisc, isa-sim and not riscv-tools), i'm stuck in the debugging with Gdb phase here.
In the second terminal target remote in gdb times out.
In the first terminal when i run spike --gdb-port 9824 pk tests/debug or spike --gdb-port 9824 pk hello.c it yields:
spike: unrecognized option --gdb-port
usage: spike [host options] <target program> [target options]
Host Options:
-p <n> Simulate <n> processors
-m <n> Provide <n> MB of target memory
-d Interactive debug mode
-g Track histogram of PCs
-h Print this help message
--ic=<S>:<W>:<B> Instantiate a cache model with S sets,
--dc=<S>:<W>:<B> W ways, and B-byte blocks (with S and
--l2=<S>:<W>:<B> B both powers of 2).
--extension=<name> Specify RoCC Extension
--extlib=<name> Shared library to load
I don't know if it has to do with configuring gdb on its own ? Or is it built and configured when i ran ./build.sh for the riscv tools.
If not, could you please correct the --gdb-port command (I'm new to linux) I've tried --gdb-port=9824 or --gdb-port:9824 and it's the same.
Thank you
Message spike: unrecognized option --gdb-port says that spike, not gdb can't recognize option. Spike is from riscv-isa-sim, not from riscv-tools. And LowRisc variant of Spike - https://github.com/lowRISC/riscv-isa-sim is many commits behind master:
This branch is 3 commits ahead, 172 commits behind riscv:master.
Latest commit e220bc4 on May 19, 2016 #wsong83 wsong83 Merge commit '0d084d5' into update
One of not ported commit added gdb support to spike from https://github.com/riscv/riscv-isa-sim (and documented it in https://github.com/riscv/riscv-isa-sim#debugging-with-gdb), but it is not pulled to https://github.com/lowRISC/riscv-isa-sim (and not documented at https://github.com/lowRISC/riscv-isa-sim). gdb-related commits were from Oct 2016, Jun 2016, May 2016, and the --gdb-port was added in d1d8863086c57f04236418f21ef8a7fbfc184b0b (Mar 19, 2016) https://github.com/riscv/riscv-isa-sim/commit/d1d8863086c57f04236418f21ef8a7fbfc184b0b
+ fprintf(stderr, " --gdb-port=<port> Listen on <port> for gdb to connect\n");
+ parser.option(0, "gdb-port", 1, [&](const char* s){gdb_port = atoi(s);});
You can try merging changes between isa sims or ask lowRisc authors to merge or just try to use spike from riscv...

Valgrind: fatal error: memcheck.h: No such file or directory

We are trying to track down a Conditional jump or move depends on uninitialised value in a C++ project reported by Valgrind. The address provided in the finding is not really helpful because it points to the end of a GCC extended assembly block, and not the actual variable causing the trouble.
According to the Valgrind's Eliminating undefined values with Valgrind, the easy way, we can use VALGRIND_CHECK_MEM_IS_DEFINED or VALGRIND_CHECK_VALUE_IS_DEFINED after including <memcheck.h>. Additionally, those macros or functions are apparently documented in the header file (there is definitely no man page for them).
However, when I include <memcheck.h> or <valgrind/memcheck.h>, it results in:
fatal error: memcheck.h: No such file or directory
Based on Stack Overflow's How do I find which rpm package supplies a file I'm looking for?, I performed a RPM file search, but its returning 0 hits for memcheck.h.
QUESTIONS
The blog article is a bit dated. Does the information still apply?
If the information is accurate, then where do I find memcheck.h?
$ uname -a
Linux localhost.localdomain 4.1.4-200.fc22.x86_64 #1 SMP Tue Aug 4 03:22:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 5.1.1 20150618 (Red Hat 5.1.1-4)
...
$ valgrind --version
valgrind-3.10.1
You have to install the RPM valgrind-devel which contains memcheck.h.
The *-devel packages are typically located in the "optional" repositories (e.g. rhel-x86_64-server-optional-6 on RHEL 6). Also, you can find the RPM on Google, download it, and install it on its own. With either approach, memcheck.h is typically placed in /usr/include/valgrind once installed.
Another way to dig into uninitialised value error with valgrind is to
use the embedded gdbserver.
You can then put breakpoints in your program, and interactively
check the definedness of various addresses/length using various
memcheck monitor commands such as:
check_memory [addressable|defined] <addr> [<len>]
check that <len> (or 1) bytes at <addr> have the given accessibility
and outputs a description of <addr>
See e.g. http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.monitor-commands
for more information

Threads looping system() and cout corrupt the stack

The process running the following code crashes with a Segmentation fault:
#include <stdlib.h>
#include <iostream>
#include <pthread.h>
void* f( void* )
{
while( true )
{
// It crashes inside this call (with cerr, too).
std::cout << 0;
}
return NULL;
}
int main()
{
pthread_t t;
pthread_create( &t, NULL, &f, NULL );
while( true )
{
// It crashes with any script/app; true is just simple.
system( "true" );
}
return 0;
}
It crashes about every other execution within a few seconds (output has anywhere from thousands to millions of '0's). It crashes a few functions deep in the cout << 0 call with the above code. Depending on extra functions called or data put on the stack in f(), it crashes in different places. In gdb, sometimes the stack doesn't make sense with regard to the order of the function calls. From this I deduce the stack is corrupted.
I found there are some problems with multi-threaded applications calling fork() (see also two of the comments mentioning stack corruption). Forking/cloning a process copies the file descriptors if they aren't set to FD_CLOEXEC. However, there are no explicitly created file descriptors. (I tried setting FD_CLOEXEC on fileno( stdout ) and fileno( stderr ) with no positive change.)
Even without explicit file descriptors can I not mix threads and fork()? Do I simply need to replace the system() call with equivalent functionality? Or is there a bug in the kernel that causes this crash and has been fixed after 2.6.30?
Other Details
I am running it on an ARM AT91 processor (armv5tejl) with Linux 2.6.30 (with some overlays and patches for my specific set of peripherals) compiled with GCC 4.3.2.
Linux 2.6.30 #1 Thu May 29 15:43:04 CDT 2014 armv5tejl GNU/Linux
I had been [cross] compiling it with -g and -O0, but without those it still crashes:
arm-atmel-linux-gnueabi-g++ -o system_thread system_thread.cpp -lpthread
I've also tried the -fstack-protector-all flag: Sometimes it crashes in __stack_chk_fail(), but sometimes other function pointers or data get corrupted and it crashes earlier.
The libraries it loads (from strace):
libpthread.so.0
libstdc++.so.6
libm.so.6
libgcc_s.so.1
libc.so.6
Note: Since it sometimes does not crash and is not really responsive to ^C, I typically run it in the background:
$ killall -9 system_thread; rm -f log; system_thread >log &
I have compiled this program for a few different architectures and Linux kernel versions, but I have not seen it crash anywhere else:
Linux 3.10.29 #1 Wed Feb 12 17:12:39 CST 2014 armv5tejl GNU/Linux
Linux 3.6.0-dirty #3 Wed May 28 13:53:56 CDT 2014 microblaze GNU/Linux
Linux 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:06:16 UTC 2014 x86_64 x86_64 GNU/Linux
Linux 3.8.0-35-generic #50~precise1-Ubuntu SMP Wed Dec 4 17:25:51 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
EDIT: Note that on the same architecture (armv5tejl) it does not crash with Linux 3.10.29. Also, it does not crash when running on an earlier version of my "appliance" (older server and client applications), having the same version of Linux - 2.6.30. So the environment of the OS has some effect.
BusyBox v1.20.1 provides sh that system() calls.
This is reproducible on an ARM processor using the 2.6.30 kernel that you mentioned, but not in master. We can use git bisect to find where this bug was fixed (it took about 16 iterations). Note that, since git bisect is meant to find regressions, but in this case master is "good" but a past version is "bad," we need to reverse the meanings of "good" and "bad".
The culprit found by the bisection is this commit, to fix "an instance of userspace data corruption" involving fork(). This symptom is very similar to the symptom you describe, and could also corrupt memory outside of the stack. After backporting this commit and the required parent to the 2.6.30 kernel, the code you posted no longer crashes.

ZeroMQ C++ multi-threaded server example runtime error

I'm trying to run the ZeroMQ multithreaded C++ server example, which builds fine with
$ g++ server.cpp -lpthread -lzmq -o server -Wall
Using OS X 10.6.5, gcc version 4.2.1 (Apple Inc. build 5664), and zeromq2's lastest master branch (Dec 1st). However I'm getting a runtime error immediately after I start the server (with ./server)
terminate called after throwing an instance of 'zmq::error_t'
what(): Operation not supported by device
Is the code provided on the blog no longer current? Or or have I misconfigured? ZMQ seems to be working fine for me otherwise on this machine (simple request/reply socket patterns).
Ridiculous. "tcp://localhost:5555" will fail, but "tcp://127.0.0.1:5555" works fine.
Update 1:
/etc/hosts has an entry for localhost so I don't believe that's the problem. I've also tried using tcp://lo:5555 to no success.